need to optimize perl script

Question

I have two files, each approximately 700,000 lines, that look like:

File1

1 rs58108140 0 10583 A G
1 rs189107123 0 10611 G C
1 rs180734498 0 13302 T C
1 rs144762171 0 13327 C G
22 rs1001008 0 44928369 G A

File2

hg19chrc snpid a1 a2 bp info or se p ngt
1 rs4951859 C G 729679 0.631 0.97853 0.0173 0.2083 0
1 rs142557973 T C 731718 0.665 1.01949 0.0198 0.3298 0
22 rs1001008 A G 44928369 0.969 0.98649 0.0107 0.2023 0

I have a script where I'm finding lines where fields 1 and 4 from File 1 and fields 1 and 5 from File 2 match. Then if fields 5 and 6 from File 1 match fields 3 and 4 from File 2, I'm simply printing out the line from File 2. However, if fields 3 and 4 (File2) are flipped with respect to fields 5 and 6 (File1), I'm taking the reciprocal of field 7 from File2, and printing the line with that adjusted field 7:

#! perl -w                                                                                                           
use strict;
use warnings;

my @kgloci;
open( my $loci_in, "<", "File1" ) or die $!;
while (<$loci_in>) {
    my ($chr, $snpID, $dist, $bp, $A1, $A2) = split;
    next if m/Chromosome/;
    push @kgloci, [$chr, $snpID, $dist, $bp, $A1, $A2];
}
close $loci_in;

my $filename = shift @ARGV;
open( my $input, "<", "File2" ) or die $!;
while (<$input>) {
    next if m/hg19chrc/;
    my ($chr, $snpID, $A1, $A2, $bp, $info, $or, $se, $p, $ngt) = split;
    foreach my $kglocus (@kgloci) {
        if (    $chr == $kglocus->[0]
                and $bp == $kglocus->[3]
                and $A1 eq $kglocus->[4] ){
        print "$chr $snpID $A1 $A2 $bp $info $or $se $p $ngt\n";
            next;
        }
        elsif ( $chr == $kglocus->[0]
            and $bp == $kglocus->[3]
            and $A1 eq $kglocus->[5]){
        my $new_or = 1/$or;
        print "$chr $snpID $A1 $A2 $bp $info $new_or $se $p $ngt\n";  
        next;
        }
    }
}
close($input);

As is, the script will run for days. Can someone point out a way to increase efficiency?

Import the files into a database?

TLP
– TLP

2015-07-22 22:56:05 +00:00
Commented Jul 22, 2015 at 22:56 — TLP
– TLP, Commented Jul 22, 2015 at 22:56

choroba · Accepted Answer · 2015-07-22 22:25:32Z

4

Use a hash instead of an array.

The following script should give the same output as yours (unless A1 and A2 are the same in File1):

#! /usr/bin/perl
use warnings;
use strict;

my %lookup;
open my $LOCI, '<', 'File1' or die $!;
while (<$LOCI>) {
    next if /Chromosome/;
    my ($chr, $snpID, $dist, $bp, $A1, $A2) = split;
    $lookup{"$chr:$bp:$A1"} = 1;
    $lookup{"$chr:$bp:$A2"} = 2;
}

open my $IN, '<', 'File2' or die $!;
while (<$IN>) {
    next if m/hg19chrc/;
    my ($chr, $snpID, $A1, $A2, $bp, $info, $or, $se, $p, $ngt) = split;
    my $new_or = ( sub {},
                   sub { shift },
                   sub { 1 / shift },
                 )[ $lookup{"$chr:$bp:$A1"} || 0 ]->($or);
    print "$chr $snpID $A1 $A2 $bp $info $new_or $se $p $ngt\n"
        if defined $new_or;
}

This is how I created testing data:

perl -MList::Util=shuffle -wE '
    $i = 1;
    $pos = 1;
    for (1 .. 10000) {
        say join " ", $i, "rs$pos", int rand 5, 1000 + int rand 10000,
            (shuffle(qw(A C T G)))[0,1];
        ++$i if rand 1 > .99;
        $pos += int rand 20;
    }' > File1

perl -wE '
    $i = 1;
    $pos = 1;
    for (1 .. 10000) {
        say join " ", $i, "rs$pos", (map qw(A C T G)[rand 4],1,2),
            1000 + int rand 10000, rand 1, rand 5, rand 1, rand 1,
            int rand 3;
        ++$i if rand 1 > .99;
        $pos += int rand 20;
    }' > File2

Results: the old script takes 16 seconds, the new one less than 0.1s. For larger files (700_000 lines), the new script takes 4s.

edited Jul 22, 2015 at 22:25

answered Jul 22, 2015 at 21:59

choroba

245k27 gold badges221 silver badges304 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

theo4786 Over a year ago

This script is very fast, thank you. I did notice however that your script is outputting incorrect value for field 7 in some cases. For example, where File1 is: 22 rs1296744 0 17968155 A G and File2 is 22 rs1296744 A G 17968155 0.956 0.9977 0.0108 0.831 0 the output is 22 rs1296744 A G 17968155 0.956 1.00230530219505 0.0108 0.831 0 where field 7 should actually remain 0.9977. Strangely my script is outputting two lines for this one row, one with the correct field 7 and one with the adjusted field 7

choroba Over a year ago

@theo4786: There's probably another line influencing the output. With only these two lines, I'm getting the same results from both the scripts.

theo4786 Over a year ago

Yes you're right, for these instances there are two input lines containing duplicate fields 1 and 4 in File1, but different values in the other fields.

choroba Over a year ago

@theo4786: Can you show such lines and indicate correct output?

theo4786 Over a year ago

For example, these lines in File1: 22 rs1296744 0 17968155 A G and 22 rs201055523 0 17968155 AGG A. Correct output is 22 rs1296744 A G 17968155 0.956 0.9977 0.0108 0.831 0. I believe I can achieve this by $lookup{"$chr:$snpID:$bp:$A1"}, as I want output where $snpID also matches.

|

Collectives™ on Stack Overflow

need to optimize perl script

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related