0

How to remove duplicate lines?

My current code:

use strict;
use warnings;
my $input = input.txt;
my $output = output.txt;
my %seen;

open("OP",">$output") or die;
open("IP","<$input") or die;

while(my $string = <IP>) {
    my @arr1 = join("",$string);
    my @arr2 = grep { !$seen{$_}++ } @arr1;
    print "@arr2\n";
    print OP "@arr2\n";
}

close("IP");
close("OP");

Input:

india
australia
america
singapore
india
america

Expected output :

india
australia
america
singapore
3
  • I don't know what you think that join statement will do, but it is in fact not doing anything. The loop is redundant, you can just grep the whole file and print it right away. E.g. perl -e'print grep !$seen{$_}++, <>;' input > output Commented Jun 22, 2021 at 17:16
  • Recommended use of open, Modern Perl Programming. Commented Jun 22, 2021 at 17:26
  • @Noor Perl's join is used to concatenate string pieces into one string Commented Jun 22, 2021 at 18:04

4 Answers 4

4

Use this Perl one-liner to delete all duplicates, whether adjacent or not:

perl -ne 'print unless $seen{$_}++;' input.txt > output.txt

To delete only adjacent duplicates (as in UNIX uniq command):

perl -ne 'print unless $_ eq $prev; $prev = $_; ' input.txt > output.txt

The Perl one-liners use these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.

When the line is seen for the first time, $seen{$_} is evaluated first, and is false, so the line is printed. Then, $seen{$_} is incremented by one, which makes it true every time the line is seen again (thus the same line is not printed any more).

The first one-liner avoids reading the entire file into memory all at once, which could be important for inputs with lots of long duplicated lines. Only the first occurrence of every line is stored in memory, together with its number of occurrences.

SEE ALSO:

Sign up to request clarification or add additional context in comments.

4 Comments

Is there a way to make it more like the uniq command in the sense that only adjacent duplicate lines are collapsed into a single line?
Beautiful! I hope without sounding too demanding, I could ask for the moonshot: For the latter case, it'd be fantastic to print a star for each suppressed duplicate line. So, if a line appears 4 times in a row, it'll be shown once with 3 stars in front of it :)
@Unknown I suggest to ask this as a separate question. You will get more and better answers. Good luck!
I'll just code it in C. Perl is not made for mere mortals :)
2

Please investigate the following code snippet, you was very close to utilize %seen hash.

use strict;
use warnings;
use feature 'say';

my %seen;
my @uniq;

while( <DATA> ) {
    chomp;
    push @uniq, $_ unless $seen{$_};
    $seen{$_} = 1;
}

say for @uniq;

__DATA__
india
australia
america
singapore
india
america

Output

india
australia
america
singapore

Comments

2

Removed unwanted line of codes from script.

Here is the updated script:

use strict; use warnings;
use Data::Dumper;

my %seen;

my @lines = <DATA>;
chomp @lines;

my @contries = grep { !$seen{$_}++ } @lines;
print Dumper(\@contries);

__DATA__
india
australia
america
singapore
india
america

Result:

$VAR1 = [
          'india',
          'australia',
          'america',
          'singapore'
        ];

Comments

1

You are making this all far too complicated. The main section of your code can be simplified to:

while (<IP>) {
  print unless $seen{$_}++;
}

Or even:

print grep { ! $seen{$_}++ } <IP>;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.