8

I'm trying to print duplicate lines from the filehandle, not remove them or anything else I see asked on other questions. I don't have enough experience with perl to be able to quickly do this, so I'm asking here. What's the way to do this?

2
  • 2
    A lot depends on the size of input, sizes of lines and the potential number of duplicates. If the memory requirements are low, then the solutions with a %duplicates hash are adequate. Commented May 4, 2011 at 13:57
  • They are. I'm just using the <DATA> filehandle to quickly check something. It doesn't look like there are any duplicates, so that's good. Commented May 4, 2011 at 14:00

4 Answers 4

25

Using the standard Perl shorthands:

my %seen;
while ( <> ) { 
    print if $seen{$_}++;
}

As a "one-liner":

perl -ne 'print if $seen{$_}++'

More data? This prints <file name>:<line number>:<line>:

perl -ne 'print ( $ARGV eq "-" ? "" : "$ARGV:" ), "$.:$_" if $seen{$_}++'

Explanation of %seen:

  • %seen declares a hash. For each unique line in the input (which is coming from while(<>) in this case) $seen{$_} will have a scalar slot in the hash named by the the text of the line (this is what $_ is doing in the has {} braces).
  • Using the postfix increment operator (x++) we take the value for our expression, remembering to increment it after the expression. So, if we haven't "seen" the line $seen{$_} is undefined--but when forced into an numeric "context" like this, it's taken as 0--and false.
  • Then it's incremented to 1.

So, when the while begins to run, all lines are "zero" (if it helps you can think of the lines as "not %seen") then, the first time we see a line, perl takes the undefined value - which fails the if - and increments the count at the scalar slot to 1. Thus, it is 1 for any future occurrences at which point it passes the if condition and it printed.

Now as I said above, %seen declares a hash, but with strict turned off, any variable expression can be created on the spot. So the first time perl sees $seen{$_} it knows that I'm looking for %seen, it doesn't have it, so it creates it.

An added neat thing about this is that at the end, if you care to use it, you have a count of how many times each line was repeated.

Sign up to request clarification or add additional context in comments.

4 Comments

Can you explain how $seen{$_}++ works exactly? I get that it's assigning the current line's value to a hash table, but what is the ++ doing here that makes it find duplicates?
$seen{$_} refers to a value in the hash %seen, with the key $_, which is the current line. The ++ operator will increment the hash value. This means, the first time a key appears, its value will be false, and the print will not happen. The subsequent times it is seen, it will be >0, and so the print will execute, and print without args by default prints the $_ variable.
Ah, so the key for the hash is the line, but the value is the number of times it was found in the file -1.
perl nerds impress the hell out of me. +2 if I could!
3

Prints dupes only once:

perl -ne "print if $seen{$_}++ == 1"

2 Comments

This is like sort file.txt | uniq -d (print only duplicates) in a typical Unix shell. Is there a simple equivalent of sort file.txt | uniq -u (print only unique lines)?
@G.Cito perl -ne 'print unless $a{$_}++'. See: catonmat.net/perl-one-liners-explained-part-six
2

try this

#!/usr/bin/perl -w
use strict;
use warnings;

my %duplicates;
while (<DATA>) {
    print if !defined $duplicates{$_};
    $duplicates{$_}++;
}

1 Comment

I'd do print unless exists $duplicates{$_}. And +1 for -w, use strict and use warnings.
2

If you have a Unix-like system, you can use uniq:

uniq -d foo

or

uniq -D foo

should do what you want. More information: man uniq.

1 Comment

Interesting that uniq -d or uniq -D don't require sort to return the answer desired by the OP (print duplicated lines).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.