how to remove duplicate lines using perl script

Question

How to remove duplicate lines?

My current code:

use strict;
use warnings;
my $input = input.txt;
my $output = output.txt;
my %seen;

open("OP",">$output") or die;
open("IP","<$input") or die;

while(my $string = <IP>) {
    my @arr1 = join("",$string);
    my @arr2 = grep { !$seen{$_}++ } @arr1;
    print "@arr2\n";
    print OP "@arr2\n";
}

close("IP");
close("OP");

Input:

india
australia
america
singapore
india
america

Expected output :

india
australia
america
singapore

I don't know what you think that join statement will do, but it is in fact not doing anything. The loop is redundant, you can just grep the whole file and print it right away. E.g. perl -e'print grep !$seen{$_}++, <>;' input > output — TLP
– TLP, Commented Jun 22, 2021 at 17:16
@Noor Perl's join is used to concatenate string pieces into one string — OldManSeph
– OldManSeph, Commented Jun 22, 2021 at 18:04

Timur Shtatland · Accepted Answer · 2023-11-17 12:59:11Z

4

Use this Perl one-liner to delete all duplicates, whether adjacent or not:

perl -ne 'print unless $seen{$_}++;' input.txt > output.txt

To delete only adjacent duplicates (as in UNIX uniq command):

perl -ne 'print unless $_ eq $prev; $prev = $_; ' input.txt > output.txt

The Perl one-liners use these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.

When the line is seen for the first time, $seen{$_} is evaluated first, and is false, so the line is printed. Then, $seen{$_} is incremented by one, which makes it true every time the line is seen again (thus the same line is not printed any more).

The first one-liner avoids reading the entire file into memory all at once, which could be important for inputs with lots of long duplicated lines. Only the first occurrence of every line is stored in memory, together with its number of occurrences.

SEE ALSO:

perldoc perlrun: how to execute the Perl interpreter: command line switches

edited Nov 17, 2023 at 12:59

answered Jun 22, 2021 at 18:04

Timur Shtatland

12.8k3 gold badges41 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Unknown Over a year ago

Is there a way to make it more like the uniq command in the sense that only adjacent duplicate lines are collapsed into a single line?

Unknown Over a year ago

Beautiful! I hope without sounding too demanding, I could ask for the moonshot: For the latter case, it'd be fantastic to print a star for each suppressed duplicate line. So, if a line appears 4 times in a row, it'll be shown once with 3 stars in front of it :)

Timur Shtatland Over a year ago

@Unknown I suggest to ask this as a separate question. You will get more and better answers. Good luck!

Unknown Over a year ago

I'll just code it in C. Perl is not made for mere mortals :)

Polar Bear · Accepted Answer · 2021-06-22 17:18:48Z

2

Please investigate the following code snippet, you was very close to utilize %seen hash.

use strict;
use warnings;
use feature 'say';

my %seen;
my @uniq;

while( <DATA> ) {
    chomp;
    push @uniq, $_ unless $seen{$_};
    $seen{$_} = 1;
}

say for @uniq;

__DATA__
india
australia
america
singapore
india
america

Output

india
australia
america
singapore

answered Jun 22, 2021 at 17:18

Polar Bear

6,8061 gold badge8 silver badges13 bronze badges

Comments

vkk05 · Accepted Answer · 2021-06-22 17:28:05Z

2

Removed unwanted line of codes from script.

Here is the updated script:

use strict; use warnings;
use Data::Dumper;

my %seen;

my @lines = <DATA>;
chomp @lines;

my @contries = grep { !$seen{$_}++ } @lines;
print Dumper(\@contries);

__DATA__
india
australia
america
singapore
india
america

Result:

$VAR1 = [
          'india',
          'australia',
          'america',
          'singapore'
        ];

edited Jun 22, 2021 at 17:28

answered Jun 22, 2021 at 17:18

vkk05

3,23215 silver badges41 bronze badges

Comments

Dave Cross · Accepted Answer · 2021-06-23 08:06:40Z

1

You are making this all far too complicated. The main section of your code can be simplified to:

while (<IP>) {
  print unless $seen{$_}++;
}

Or even:

print grep { ! $seen{$_}++ } <IP>;

answered Jun 23, 2021 at 8:06

Dave Cross

69.5k3 gold badges55 silver badges101 bronze badges

Collectives™ on Stack Overflow

how to remove duplicate lines using perl script

4 Answers 4

4 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related