0
AB006589__ESR2,BC024181__ESR2,0.47796
AB006589__ESR2,X55739__CSN2,0.47232
AB006589__ESR2,NM_004991__MDS1,0.46704
AB006589__ESR2,NM_003476__CSRP3,0.45767
AB006589__ESR2,NM_012101__TRIM29,0.45094
AB006589__ESR2,NM_006897__HOXC9,0.41748
AB006589__ESR2,NM_000278__PAX2,0.4161
NM_003476__CSRP3,AB006589__ESR2,0.45767
NM_012101__TRIM29,AB006589__ESR2,0.45094
NM_006897__HOXC9,AB006589__ESR2,0.41748
NM_000278__PAX2,AB006589__ESR2,0.4161

Now, the problem is that line 4

AB006589__ESR2,NM_003476__CSRP3,0.45767

is a duplicate of line 8

NM_003476__CSRP3,AB006589__ESR2,0.45767

There are many cases like this in my large CSV file.

So, my question is to identify all duplicates and somehow delete one of them.

use strict;

my %hash = ();

open(tf, "tf_tf_mic.csv");

while ( <tf> ) {
    chomp;
#    print "$_\n";                                                                                                    
    my @words = split ",", $_;
    if ( exists $hash{"$words[0]\t$words[1]"} || exists $hash{"$words[1]\t$words[0]"} ) {

    }
    else{
        $hash{"$words[0]\t$words[1]"} = $_;
    }
}

foreach ( keys %hash ) {
    print "$hash{$_}\n";
}

This actually worked in 10 seconds for a 4 million line file.

4
  • What have you tried? What exactly is a "duplicate"? Why is this tagged perl? Commented Aug 20, 2016 at 23:29
  • I tried to do this in perl and I felt like there should be better way using unix. a duplicate is x,y = y,x. so my 2 and 3 columns can interchange but it is same information. Commented Aug 20, 2016 at 23:59
  • What do you mean by "using unix"? Commented Aug 21, 2016 at 0:45
  • 1
    Your latest update says "it worked" while there is residual commentary "This is my Perl code but it does not work". Since they contradict each other, at most one of them is correct. It is also generally not a good idea to post broken code and then update it so it is working code. At minimum, keep the broken code around and add the working code, or consider adding a self-answer (an answer by the person asking the question) with the fixed code and an explanation of what you fixed. You can accept that a couple of days later if it is still the best answer. Commented Aug 21, 2016 at 2:55

2 Answers 2

1

There is no need for such complication. If you sort the fields in a record so that any given pair of values is always in the same order then you can simply print a record if its contents haven't been seen before

use strict;
use warnings 'all';

my %seen;

while ( <DATA> ) {
    my @fields = sort /[^,\s]+/g;
    print unless $seen{"@fields[0,1]"}++;
}


__DATA__
AB006589__ESR2,BC024181__ESR2,0.47796
AB006589__ESR2,X55739__CSN2,0.47232
AB006589__ESR2,NM_004991__MDS1,0.46704
AB006589__ESR2,NM_003476__CSRP3,0.45767
AB006589__ESR2,NM_012101__TRIM29,0.45094
AB006589__ESR2,NM_006897__HOXC9,0.41748
AB006589__ESR2,NM_000278__PAX2,0.4161
NM_003476__CSRP3,AB006589__ESR2,0.45767
NM_012101__TRIM29,AB006589__ESR2,0.45094
NM_006897__HOXC9,AB006589__ESR2,0.41748
NM_000278__PAX2,AB006589__ESR2,0.4161

output

AB006589__ESR2,BC024181__ESR2,0.47796
AB006589__ESR2,X55739__CSN2,0.47232
AB006589__ESR2,NM_004991__MDS1,0.46704
AB006589__ESR2,NM_003476__CSRP3,0.45767
AB006589__ESR2,NM_012101__TRIM29,0.45094
AB006589__ESR2,NM_006897__HOXC9,0.41748
AB006589__ESR2,NM_000278__PAX2,0.4161
Sign up to request clarification or add additional context in comments.

Comments

1

You can reorder each line before put it into the hash:

  1. Split each line with , into fields: my @fields = split /,/; pop @fields;
  2. Sort the fields: @fields = sort @fields;
  3. Join the sorted fields into a new string: my $str = join "\t", @fields;
  4. Check if the new string exists in hash: $hash{$str} = $_ unless exists $hash{$str}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.