Perl - Find duplicate lines in file or array

Question

I'm trying to print duplicate lines from the filehandle, not remove them or anything else I see asked on other questions. I don't have enough experience with perl to be able to quickly do this, so I'm asking here. What's the way to do this?

A lot depends on the size of input, sizes of lines and the potential number of duplicates. If the memory requirements are low, then the solutions with a %duplicates hash are adequate. — Sinan Ünür
– Sinan Ünür, Commented May 4, 2011 at 13:57
They are. I'm just using the <DATA> filehandle to quickly check something. It doesn't look like there are any duplicates, so that's good. — Chris
– Chris, Commented May 4, 2011 at 14:00

G. Cito · Accepted Answer · 2013-07-15 14:36:27Z

25

Using the standard Perl shorthands:

my %seen;
while ( <> ) { 
    print if $seen{$_}++;
}

As a "one-liner":

perl -ne 'print if $seen{$_}++'

More data? This prints <file name>:<line number>:<line>:

perl -ne 'print ( $ARGV eq "-" ? "" : "$ARGV:" ), "$.:$_" if $seen{$_}++'

Explanation of %seen:

%seen declares a hash. For each unique line in the input (which is coming from while(<>) in this case) $seen{$_} will have a scalar slot in the hash named by the the text of the line (this is what $_ is doing in the has {} braces).
Using the postfix increment operator (x++) we take the value for our expression, remembering to increment it after the expression. So, if we haven't "seen" the line $seen{$_} is undefined--but when forced into an numeric "context" like this, it's taken as 0--and false.
Then it's incremented to 1.

So, when the while begins to run, all lines are "zero" (if it helps you can think of the lines as "not %seen") then, the first time we see a line, perl takes the undefined value - which fails the if - and increments the count at the scalar slot to 1. Thus, it is 1 for any future occurrences at which point it passes the if condition and it printed.

Now as I said above, %seen declares a hash, but with strict turned off, any variable expression can be created on the spot. So the first time perl sees $seen{$_} it knows that I'm looking for %seen, it doesn't have it, so it creates it.

An added neat thing about this is that at the end, if you care to use it, you have a count of how many times each line was repeated.

edited Jul 15, 2013 at 14:36

G. Cito

6,4063 gold badges31 silver badges42 bronze badges

answered May 4, 2011 at 13:50

Axeman

29.9k2 gold badges50 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Chris Over a year ago

Can you explain how $seen{$_}++ works exactly? I get that it's assigning the current line's value to a hash table, but what is the ++ doing here that makes it find duplicates?

TLP Over a year ago

$seen{$_} refers to a value in the hash %seen, with the key $_, which is the current line. The ++ operator will increment the hash value. This means, the first time a key appears, its value will be false, and the print will not happen. The subsequent times it is seen, it will be >0, and so the print will execute, and print without args by default prints the $_ variable.

Chris Over a year ago

Ah, so the key for the hash is the line, but the value is the number of times it was found in the file -1.

pedorro Over a year ago

perl nerds impress the hell out of me. +2 if I could!

makes · Accepted Answer · 2011-11-03 17:19:42Z

3

Prints dupes only once:

perl -ne "print if $seen{$_}++ == 1"

edited Nov 3, 2011 at 17:19

makes

6,5683 gold badges42 silver badges59 bronze badges

answered Nov 2, 2011 at 20:08

Alex B

311 bronze badge

2 Comments

G. Cito Over a year ago

This is like sort file.txt | uniq -d (print only duplicates) in a typical Unix shell. Is there a simple equivalent of sort file.txt | uniq -u (print only unique lines)?

jubilatious1 Feb 27 at 14:47

@G.Cito perl -ne 'print unless $a{$_}++'. See: catonmat.net/perl-one-liners-explained-part-six

mcgrailm · Accepted Answer · 2011-05-04 13:50:32Z

2

try this

#!/usr/bin/perl -w
use strict;
use warnings;

my %duplicates;
while (<DATA>) {
    print if !defined $duplicates{$_};
    $duplicates{$_}++;
}

answered May 4, 2011 at 13:50

mcgrailm

17.7k22 gold badges86 silver badges131 bronze badges

1 Comment

Blrfl Over a year ago

I'd do print unless exists $duplicates{$_}. And +1 for -w, use strict and use warnings.

Svante · Accepted Answer · 2011-05-04 16:07:37Z

2

If you have a Unix-like system, you can use uniq:

uniq -d foo

or

uniq -D foo

should do what you want. More information: man uniq.

answered May 4, 2011 at 16:07

Svante

51.8k11 gold badges84 silver badges127 bronze badges

1 Comment

jubilatious1 Feb 27 at 14:17

Interesting that uniq -d or uniq -D don't require sort to return the answer desired by the OP (print duplicated lines).

Collectives™ on Stack Overflow

Perl - Find duplicate lines in file or array

4 Answers 4

4 Comments

2 Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

2 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related