Strip punctuation from text.


/ Published in: PHP
Save to your folder(s)

When processing text for a search engine or analysis tool, code needs to strip out punctuation, formatting, spacing, and control characters to reveal indexable text. In international text there are hundreds of these characters, and some should be removed in one context, but not in another. This tip shows how.


Copy this code and paste it in your HTML
  1. function strip_punctuation( $text )
  2. {
  3. $urlbrackets = '\[\]\(\)';
  4. $urlspacebefore = ':;\'_\*%@&?!' . $urlbrackets;
  5. $urlspaceafter = '\.,:;\'\-_\*@&\/\\\\\?!#' . $urlbrackets;
  6. $urlall = '\.,:;\'\-_\*%@&\/\\\\\?!#' . $urlbrackets;
  7.  
  8. $specialquotes = '\'"\*<>';
  9.  
  10. $fullstop = '\x{002E}\x{FE52}\x{FF0E}';
  11. $comma = '\x{002C}\x{FE50}\x{FF0C}';
  12. $arabsep = '\x{066B}\x{066C}';
  13. $numseparators = $fullstop . $comma . $arabsep;
  14.  
  15. $numbersign = '\x{0023}\x{FE5F}\x{FF03}';
  16. $percent = '\x{066A}\x{0025}\x{066A}\x{FE6A}\x{FF05}\x{2030}\x{2031}';
  17. $prime = '\x{2032}\x{2033}\x{2034}\x{2057}';
  18. $nummodifiers = $numbersign . $percent . $prime;
  19.  
  20. return preg_replace(
  21. // Remove separator, control, formatting, surrogate,
  22. // open/close quotes.
  23. '/[\p{Z}\p{Cc}\p{Cf}\p{Cs}\p{Pi}\p{Pf}]/u',
  24. // Remove other punctuation except special cases
  25. '/\p{Po}(?<![' . $specialquotes .
  26. $numseparators . $urlall . $nummodifiers . '])/u',
  27. // Remove non-URL open/close brackets, except URL brackets.
  28. '/[\p{Ps}\p{Pe}](?<![' . $urlbrackets . '])/u',
  29. // Remove special quotes, dashes, connectors, number
  30. // separators, and URL characters followed by a space
  31. '/[' . $specialquotes . $numseparators . $urlspaceafter .
  32. '\p{Pd}\p{Pc}]+((?= )|$)/u',
  33. // Remove special quotes, connectors, and URL characters
  34. // preceded by a space
  35. '/((?<= )|^)[' . $specialquotes . $urlspacebefore . '\p{Pc}]+/u',
  36. // Remove dashes preceded by a space, but not followed by a number
  37. '/((?<= )|^)\p{Pd}+(?![\p{N}\p{Sc}])/u',
  38. // Remove consecutive spaces
  39. '/ +/',
  40. ),
  41. ' ',
  42. $text );
  43. }

URL: http://nadeausoftware.com/articles/2007/9/php_tip_how_strip_punctuation_characters_web_page#Removinglineparagraphandwordseparators

Report this snippet


Comments

RSS Icon Subscribe to comments

You need to login to post a comment.