Posted By

noah on 11/04/08


Tagged

fix regex html strip text translate Microsoft word MS office


Versions (?)

Who likes this?

1 person have marked this snippet as a favorite

icecreamboyy


Clean up Word documents that have been translated to HTML


 / Published in: Perl
 

Haven't tried this with any recent versions of Word. Yet.

  1. #/usr/local/bin/perl -w
  2. use strict;
  3.  
  4. #############################################################
  5. # #
  6. # #
  7. # #
  8. # NOAH SUSSMAN #
  9. # #
  10. # clean up word #
  11. # #
  12. # Created 5/16/01 at 02:33 PM #
  13. # #
  14. # Clean up Word documents that have been translated to HTML #
  15. # #
  16. # #
  17. #############################################################
  18.  
  19. @ARGV[0] = "Macintosh HD:NOAH:2001:05-MAY 2001:3-May 15-21:3-Revisions to Corp Site:large number of Word docs:1.2 Services.html" ;
  20.  
  21. $^I=".bk";
  22.  
  23. undef $/ ; # slurp the whole file into $_
  24.  
  25. while (<>) {
  26.  
  27. s{<(?!/?(a|b|img|center|p|ul|ol|li|table|td|tr|html|body|head|title))\s*[^>]*>\s*}{}gi; # Destroy all tags except A, B, IMG, CENTER, P, UL, OL, LI, TABLE, TD, TR, HTML, BODY, HEAD and TITLE
  28.  
  29. s{<(\w+)>(.*?)<([^$1])>(.*?)<(/$1)>(.*?)<(/$2)>}{<$3>$2<$1>$4<$5>$6<$7>}gi; # Fix mis-nested tags, if any.
  30.  
  31. print $_ ;
  32.  
  33. }

Report this snippet  

You need to login to post a comment.