How to find text in data file and calculate average using perl

Question

I would like to replace a grep | awk | perl command with a pure perl solution to make it quicker and simpler to run.

I want to match each line in input.txt with a data.txt file and calculate the average of the values with the matched ID names and numbers.

The input.txt contains 1 column of ID numbers:

FBgn0260798
FBgn0040007
FBgn0046692

I would like to match each ID number with it's corresponding ID names and associated value. Here's an example of data.txt where column 1 is the ID number, columns 2 and 3 are ID name1 and ID name2 and column 3 contains the values I want to calculate the average.

FBgn0260798 CG17665 CG17665 21.4497
FBgn0040007 Gprk1   CG40129 22.4236
FBgn0046692 RpL38   CG18001 1182.88

So far I used grep and awk to produce an output file containing the corresponding values for matched ID numbers and values and then used that output file to calculate the counts and averages using the following commands:

# First part using grep | awk
exec < input.txt
while read line
    do
            grep -w $line data.txt | cut -f1,2,3,4 | awk '{print $1,$2,$3,$4} ' >> output.txt
    done
 # Second part with perl

open my $input, '<', "output_1.txt" or die; ## the output file is from the first part and has the same layout as the data.txt file

my $total = 0;
my $count = 0;

while (<$input>) {

    my ($name, $id1, $id2, $value) = split;
    $total += $value;
    $count += 1;

}

print "The total is $total\n";
print "The count is $count\n";
print "The average is ", $total / $count, "\n";

Both parts work OK but I would like to make it simplify it by running just one script. I've been trying to find a quicker way of running the whole lot together in perl but after several hours of reading, I am totally stuck on how to do it. I've been playing around with hashes, arrays, if and elsif statements without zero success. If anyone has suggestions etc, that would be great.

Thanks, Harriet

Please explain what it is that you are asking for. The program you show would successfully print the mean value of the fourth column of output_1.txt. Do you need any more? You seem to be asking for a pure-Perl solution to replace a grep | awk | perl command. Please explain more thoroughly, and show your complete command line. — Borodin
– Borodin, Commented Mar 10, 2014 at 12:50

David W. · Accepted Answer · 2014-03-11 14:35:58Z

1

If I understand you, you have a data file that contains the name of each line and the value for that line. The other two IDs are not important.

You will use a new file called an input file that will contain matching names as found in the data file. These are the values you want to average.

The fastest way is to create a hash that is keyed by the names and the values will be the value for that name in the data file. Because this is a hash, you can quickly locate the corresponding value. This is much faster than grep`ing the same array over and over again.

This first part will read in the data.txt file and store the name and value in a hash keyed by the name.

use strict;
use warnings;
use autodie;   # This way, you don't have to check if you can't open the file
use feature qw(say);

use constant {
    INPUT_NAME  => "input.txt",
    DATA_FILE   => "data.txt",
};

#
# Read in data.txt and get the values and keys
#
open my $data_fh, "<", DATA_FILE;
my %ids;
while ( my $line = <$data_fh> ) {
    chomp $line;
    my ($name, $id1, $id2, $value) = split /\s+/, $line;
    $ids{$name} = $value;
}
close $data_fh;

Now, that you have this hash, it's easy to read through the input.txt file and locate the matching name in the data.txt file:

open $input_fh, "<", INPUT_FILE;
my $count = 0;
my $total = 0;
while ( my $name = <$input_fh> ) {
    chomp $name;
    if ( not defined $ids{$name} ) {
         die qq(Cannot find matching id "$name" in data file\n);
    }
    $total += $ids{$name};
    $count += 1;
}
close $input_fh;
say "Average = " $total / $count;

You read through each file once. I am assuming that you only have a single instance of each name in each file.

edited Mar 11, 2014 at 14:35

answered Mar 10, 2014 at 15:28

David W.

107k40 gold badges224 silver badges349 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user1879573 Over a year ago

Thank you for the explanation, it's really detailed and helped loads with my understanding (I've recently started learning Perl). One problem, I keep getting this error message: syntax error at match_avg_test.pl line 10, near "DATA_FILE" test.pl aborted due to compilation errors. I've commented out the strict warnings to get rid of the explicit name error messages and checked everything several times but I can't see it. If you've got any ideas, that would be great

David W. Over a year ago

Those lines should end in commas and not semicolons. Sorry.

Collectives™ on Stack Overflow

How to find text in data file and calculate average using perl

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related