0

I have two files, each approximately 700,000 lines, that look like:

File1

1 rs58108140 0 10583 A G
1 rs189107123 0 10611 G C
1 rs180734498 0 13302 T C
1 rs144762171 0 13327 C G
22 rs1001008 0 44928369 G A

File2

hg19chrc snpid a1 a2 bp info or se p ngt
1 rs4951859 C G 729679 0.631 0.97853 0.0173 0.2083 0
1 rs142557973 T C 731718 0.665 1.01949 0.0198 0.3298 0
22 rs1001008 A G 44928369 0.969 0.98649 0.0107 0.2023 0

I have a script where I'm finding lines where fields 1 and 4 from File 1 and fields 1 and 5 from File 2 match. Then if fields 5 and 6 from File 1 match fields 3 and 4 from File 2, I'm simply printing out the line from File 2. However, if fields 3 and 4 (File2) are flipped with respect to fields 5 and 6 (File1), I'm taking the reciprocal of field 7 from File2, and printing the line with that adjusted field 7:

#! perl -w                                                                                                           
use strict;
use warnings;

my @kgloci;
open( my $loci_in, "<", "File1" ) or die $!;
while (<$loci_in>) {
    my ($chr, $snpID, $dist, $bp, $A1, $A2) = split;
    next if m/Chromosome/;
    push @kgloci, [$chr, $snpID, $dist, $bp, $A1, $A2];
}
close $loci_in;

my $filename = shift @ARGV;
open( my $input, "<", "File2" ) or die $!;
while (<$input>) {
    next if m/hg19chrc/;
    my ($chr, $snpID, $A1, $A2, $bp, $info, $or, $se, $p, $ngt) = split;
    foreach my $kglocus (@kgloci) {
        if (    $chr == $kglocus->[0]
                and $bp == $kglocus->[3]
                and $A1 eq $kglocus->[4] ){
        print "$chr $snpID $A1 $A2 $bp $info $or $se $p $ngt\n";
            next;
        }
        elsif ( $chr == $kglocus->[0]
            and $bp == $kglocus->[3]
            and $A1 eq $kglocus->[5]){
        my $new_or = 1/$or;
        print "$chr $snpID $A1 $A2 $bp $info $new_or $se $p $ngt\n";  
        next;
        }
    }
}
close($input);

As is, the script will run for days. Can someone point out a way to increase efficiency?

1
  • Import the files into a database? Commented Jul 22, 2015 at 22:56

1 Answer 1

4

Use a hash instead of an array.

The following script should give the same output as yours (unless A1 and A2 are the same in File1):

#! /usr/bin/perl
use warnings;
use strict;

my %lookup;
open my $LOCI, '<', 'File1' or die $!;
while (<$LOCI>) {
    next if /Chromosome/;
    my ($chr, $snpID, $dist, $bp, $A1, $A2) = split;
    $lookup{"$chr:$bp:$A1"} = 1;
    $lookup{"$chr:$bp:$A2"} = 2;
}

open my $IN, '<', 'File2' or die $!;
while (<$IN>) {
    next if m/hg19chrc/;
    my ($chr, $snpID, $A1, $A2, $bp, $info, $or, $se, $p, $ngt) = split;
    my $new_or = ( sub {},
                   sub { shift },
                   sub { 1 / shift },
                 )[ $lookup{"$chr:$bp:$A1"} || 0 ]->($or);
    print "$chr $snpID $A1 $A2 $bp $info $new_or $se $p $ngt\n"
        if defined $new_or;
}

This is how I created testing data:

perl -MList::Util=shuffle -wE '
    $i = 1;
    $pos = 1;
    for (1 .. 10000) {
        say join " ", $i, "rs$pos", int rand 5, 1000 + int rand 10000,
            (shuffle(qw(A C T G)))[0,1];
        ++$i if rand 1 > .99;
        $pos += int rand 20;
    }' > File1

perl -wE '
    $i = 1;
    $pos = 1;
    for (1 .. 10000) {
        say join " ", $i, "rs$pos", (map qw(A C T G)[rand 4],1,2),
            1000 + int rand 10000, rand 1, rand 5, rand 1, rand 1,
            int rand 3;
        ++$i if rand 1 > .99;
        $pos += int rand 20;
    }' > File2

Results: the old script takes 16 seconds, the new one less than 0.1s. For larger files (700_000 lines), the new script takes 4s.

Sign up to request clarification or add additional context in comments.

6 Comments

This script is very fast, thank you. I did notice however that your script is outputting incorrect value for field 7 in some cases. For example, where File1 is: 22 rs1296744 0 17968155 A G and File2 is 22 rs1296744 A G 17968155 0.956 0.9977 0.0108 0.831 0 the output is 22 rs1296744 A G 17968155 0.956 1.00230530219505 0.0108 0.831 0 where field 7 should actually remain 0.9977. Strangely my script is outputting two lines for this one row, one with the correct field 7 and one with the adjusted field 7
@theo4786: There's probably another line influencing the output. With only these two lines, I'm getting the same results from both the scripts.
Yes you're right, for these instances there are two input lines containing duplicate fields 1 and 4 in File1, but different values in the other fields.
@theo4786: Can you show such lines and indicate correct output?
For example, these lines in File1: 22 rs1296744 0 17968155 A G and 22 rs201055523 0 17968155 AGG A. Correct output is 22 rs1296744 A G 17968155 0.956 0.9977 0.0108 0.831 0. I believe I can achieve this by $lookup{"$chr:$snpID:$bp:$A1"}, as I want output where $snpID also matches.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.