Delete duplicates when the duplicates are not in the same column and not same order in Unix

Question

AB006589__ESR2,BC024181__ESR2,0.47796
AB006589__ESR2,X55739__CSN2,0.47232
AB006589__ESR2,NM_004991__MDS1,0.46704
AB006589__ESR2,NM_003476__CSRP3,0.45767
AB006589__ESR2,NM_012101__TRIM29,0.45094
AB006589__ESR2,NM_006897__HOXC9,0.41748
AB006589__ESR2,NM_000278__PAX2,0.4161
NM_003476__CSRP3,AB006589__ESR2,0.45767
NM_012101__TRIM29,AB006589__ESR2,0.45094
NM_006897__HOXC9,AB006589__ESR2,0.41748
NM_000278__PAX2,AB006589__ESR2,0.4161

Now, the problem is that line 4

AB006589__ESR2,NM_003476__CSRP3,0.45767

is a duplicate of line 8

NM_003476__CSRP3,AB006589__ESR2,0.45767

There are many cases like this in my large CSV file.

So, my question is to identify all duplicates and somehow delete one of them.

use strict;

my %hash = ();

open(tf, "tf_tf_mic.csv");

while ( <tf> ) {
    chomp;
#    print "$_\n";                                                                                                    
    my @words = split ",", $_;
    if ( exists $hash{"$words[0]\t$words[1]"} || exists $hash{"$words[1]\t$words[0]"} ) {

    }
    else{
        $hash{"$words[0]\t$words[1]"} = $_;
    }
}

foreach ( keys %hash ) {
    print "$hash{$_}\n";
}

This actually worked in 10 seconds for a 4 million line file.

What have you tried? What exactly is a "duplicate"? Why is this tagged perl? — melpomene
– melpomene, Commented Aug 20, 2016 at 23:29
I tried to do this in perl and I felt like there should be better way using unix. a duplicate is x,y = y,x. so my 2 and 3 columns can interchange but it is same information. — ChathuraG
– ChathuraG, Commented Aug 20, 2016 at 23:59
Your latest update says "it worked" while there is residual commentary "This is my Perl code but it does not work". Since they contradict each other, at most one of them is correct. It is also generally not a good idea to post broken code and then update it so it is working code. At minimum, keep the broken code around and add the working code, or consider adding a self-answer (an answer by the person asking the question) with the fixed code and an explanation of what you fixed. You can accept that a couple of days later if it is still the best answer. — Jonathan Leffler
– Jonathan Leffler, Commented Aug 21, 2016 at 2:55

Borodin · Accepted Answer · 2016-08-21 08:37:44Z

There is no need for such complication. If you sort the fields in a record so that any given pair of values is always in the same order then you can simply print a record if its contents haven't been seen before

use strict;
use warnings 'all';

my %seen;

while ( <DATA> ) {
    my @fields = sort /[^,\s]+/g;
    print unless $seen{"@fields[0,1]"}++;
}


__DATA__
AB006589__ESR2,BC024181__ESR2,0.47796
AB006589__ESR2,X55739__CSN2,0.47232
AB006589__ESR2,NM_004991__MDS1,0.46704
AB006589__ESR2,NM_003476__CSRP3,0.45767
AB006589__ESR2,NM_012101__TRIM29,0.45094
AB006589__ESR2,NM_006897__HOXC9,0.41748
AB006589__ESR2,NM_000278__PAX2,0.4161
NM_003476__CSRP3,AB006589__ESR2,0.45767
NM_012101__TRIM29,AB006589__ESR2,0.45094
NM_006897__HOXC9,AB006589__ESR2,0.41748
NM_000278__PAX2,AB006589__ESR2,0.4161

output

AB006589__ESR2,BC024181__ESR2,0.47796
AB006589__ESR2,X55739__CSN2,0.47232
AB006589__ESR2,NM_004991__MDS1,0.46704
AB006589__ESR2,NM_003476__CSRP3,0.45767
AB006589__ESR2,NM_012101__TRIM29,0.45094
AB006589__ESR2,NM_006897__HOXC9,0.41748
AB006589__ESR2,NM_000278__PAX2,0.4161

for_stack · Accepted Answer · 2016-08-21 02:42:26Z

1

You can reorder each line before put it into the hash:

Split each line with , into fields: my @fields = split /,/; pop @fields;
Sort the fields: @fields = sort @fields;
Join the sorted fields into a new string: my $str = join "\t", @fields;
Check if the new string exists in hash: $hash{$str} = $_ unless exists $hash{$str}

answered Aug 21, 2016 at 2:42

for_stack

23.3k4 gold badges44 silver badges57 bronze badges

Collectives™ on Stack Overflow

Delete duplicates when the duplicates are not in the same column and not same order in Unix

2 Answers 2

output

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

output

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related