Posted By

noah on 07/03/07


Tagged

regex list file text filter emacs write create work useful tools automation productivity jag 2005 essential forqblog


Versions (?)

Who likes this?

3 people have marked this snippet as a favorite

resmith100
khouser
MariaFleetwood


Remove duplicate lines from a text file with Perl


 / Published in: Perl
 

URL: http://answers.google.com/answers/threadview?id=25196

Found at Google Answers.

Sometimes I get a big list of things, and some of the things occur multiple times in the same list. To make the list easier to read, I want to delete the duplicate lines.

A good example is a list of files that have errors (maybe excerpted from an application sever's log files). In that case you have a newline-delimited list of file paths, and depending upon the situation, the same file path might be listed 4 or 5 times or more. Often, it is useful to have a list of just the files that are faulty, which can be produced by deleting all the duplicate lines. This script is for filtering just those kinds of list files.

Of course, for Emacs users there is a much easier way to remove duplicate lines:, if you have uniq installed on your system.

M-x sort-lines RET C-x h M-x shell-command-on-region RET uniq RET

  1. #!/usr/bin/perl -w
  2. use strict;
  3. my $origfile = shift;
  4. my $outfile = "no_dupes_" . $origfile;
  5. my %hTmp;
  6.  
  7. open (IN, "<$origfile") or die "Couldn't open input file: $!";
  8. open (OUT, ">$outfile") or die "Couldn't open output file: $!";
  9.  
  10. while (my $sLine = <IN>) {
  11. next if $sLine =~ m/^\s*$/; #remove empty lines. Without this, still destroys empty lines except for the first one.
  12. $sLine=~s/^\s+//; #strip leading/trailing whitespace
  13. $sLine=~s/\s+$//;
  14. print OUT qq{$sLine\n} unless ($hTmp{$sLine}++);
  15. }
  16. close OUT;
  17. close IN;

Report this snippet  

Comments

RSS Icon Subscribe to comments
Posted By: karthiksomu on December 22, 2009

Hi, My input file is like this

Hai How are you Hai How are you Hai How are you Hi Hai How are you Hi

But i got the output like this after running the above script what might be the problem

Hai How are you Hi Hi

it should have removed the last string 'Hi' too but why it is not removing.

Any help would greatly appreciated.

Posted By: noah on February 10, 2010

@karthiksomu, as written the script considered trailing whitespace to be significant. So "Hi " is not the same as "Hi"

I'm not sure that was the problem in your case. But I've fixed the script so that it now ignores trailing whitespace, so try it again and see if that solves your problem?

Another (better) option would be to use sort and uniq to do this job:

$ cat file_from_karthiksomu.txt | sort | uniq
Hai How are you
Hi
Posted By: noah on February 10, 2010

@karthiksomu, as written the script considered trailing whitespace to be significant. So "Hi " is not the same as "Hi"

I'm not sure that was the problem in your case. But I've fixed the script so that it now ignores trailing whitespace, so try it again and see if that solves your problem?

Another (better) option would be to use sort and uniq to do this job:

$ cat file_from_karthiksomu.txt | sort | uniq
Hai How are you
Hi
Posted By: noah on February 10, 2010

Ack! Double posted! How embarrassing. Now I can never use the internet again due to shame :(

Posted By: MariaFleetwood on September 20, 2010

Nice and easy piece of code, thank you. It saves my time.

Posted By: MariaFleetwood on September 20, 2010

Nice and easy piece of code, thank you. It saves my time.

You need to login to post a comment.