Posted By

xyzeugene on 03/04/19


Tagged


Versions (?)

Fast perl shell script to remove stopwords from text corpus


 / Published in: Perl
 

URL: https://corpocrat.com/2015/10/12/fast-perl-shell-script-to-remove-stopwords-from-text-corpus/

If you are looking to remove certain words from a file1 with list of stopwords from file2 (one per line), use this perl script in the command line.

You can use like this

./remove.pl stopwords.txt data.txt > data.cleaned

  1. #!/usr/bin/env perl -w
  2. # usage: script.pl words text >newfile
  3. use English;
  4.  
  5. # poor man's argument handler
  6. open(WORDS, shift @ARGV) || die "failed to open words file: $!";
  7. open(REPLACE, shift @ARGV) || die "failed to open replacement file: $!";
  8.  
  9. my @words;
  10. # get all words into an array
  11. while ($_=<WORDS>) {
  12. chop; # strip eol
  13. push @words, split; # break up words on line
  14. }
  15.  
  16. # (optional)
  17. # sort by length (makes sure smaller words don't trump bigger ones); ie, "then" vs "the"
  18. @words=sort { length($b) <=> length($a) } @words;
  19.  
  20. # slurp text file into one variable.
  21. undef $RS;
  22. $text = <REPLACE>;
  23.  
  24. # now for each word, do a global search-and-replace; make sure only words are replaced; remove possible following space.
  25. foreach $word (@words) {
  26. $text =~ s/\b\Q$word\E\b\s?//sg;
  27. }
  28.  
  29. # output "fixed" text
  30. print $text;

Report this snippet  

Comments

RSS Icon Subscribe to comments
Posted By: vucogivob on March 4, 2019

whataburgersurvey talktoihop redlobstersurvey

You need to login to post a comment.