Posted By

localhorst on 03/11/08


Tagged

regex strip remove clean regexpr punctuation


Versions (?)

Who likes this?

1 person have marked this snippet as a favorite

jamesming


Strip punctuation from text.


 / Published in: PHP
 

URL: http://nadeausoftware.com/articles/2007/9/php_tip_how_strip_punctuation_characters_web_page#Removinglineparagraphandwordseparators

When processing text for a search engine or analysis tool, code needs to strip out punctuation, formatting, spacing, and control characters to reveal indexable text. In international text there are hundreds of these characters, and some should be removed in one context, but not in another. This tip shows how.

  1. function strip_punctuation( $text )
  2. {
  3. $urlbrackets = '\[\]\(\)';
  4. $urlspacebefore = ':;\'_\*%@&?!' . $urlbrackets;
  5. $urlspaceafter = '\.,:;\'\-_\*@&\/\\\\\?!#' . $urlbrackets;
  6. $urlall = '\.,:;\'\-_\*%@&\/\\\\\?!#' . $urlbrackets;
  7.  
  8. $specialquotes = '\'"\*<>';
  9.  
  10. $fullstop = '\x{002E}\x{FE52}\x{FF0E}';
  11. $comma = '\x{002C}\x{FE50}\x{FF0C}';
  12. $arabsep = '\x{066B}\x{066C}';
  13. $numseparators = $fullstop . $comma . $arabsep;
  14.  
  15. $numbersign = '\x{0023}\x{FE5F}\x{FF03}';
  16. $percent = '\x{066A}\x{0025}\x{066A}\x{FE6A}\x{FF05}\x{2030}\x{2031}';
  17. $prime = '\x{2032}\x{2033}\x{2034}\x{2057}';
  18. $nummodifiers = $numbersign . $percent . $prime;
  19.  
  20. return preg_replace(
  21. // Remove separator, control, formatting, surrogate,
  22. // open/close quotes.
  23. '/[\p{Z}\p{Cc}\p{Cf}\p{Cs}\p{Pi}\p{Pf}]/u',
  24. // Remove other punctuation except special cases
  25. '/\p{Po}(?<![' . $specialquotes .
  26. $numseparators . $urlall . $nummodifiers . '])/u',
  27. // Remove non-URL open/close brackets, except URL brackets.
  28. '/[\p{Ps}\p{Pe}](?<![' . $urlbrackets . '])/u',
  29. // Remove special quotes, dashes, connectors, number
  30. // separators, and URL characters followed by a space
  31. '/[' . $specialquotes . $numseparators . $urlspaceafter .
  32. '\p{Pd}\p{Pc}]+((?= )|$)/u',
  33. // Remove special quotes, connectors, and URL characters
  34. // preceded by a space
  35. '/((?<= )|^)[' . $specialquotes . $urlspacebefore . '\p{Pc}]+/u',
  36. // Remove dashes preceded by a space, but not followed by a number
  37. '/((?<= )|^)\p{Pd}+(?![\p{N}\p{Sc}])/u',
  38. // Remove consecutive spaces
  39. '/ +/',
  40. ),
  41. ' ',
  42. $text );
  43. }

Report this snippet  

You need to login to post a comment.