0

I have the data in this format

b1  1995    1
b1  2007    0.1
b2  1974    0.1
b2  1974    0.6
b2  1975    0.3

And, I want to sum values in column 3 based on similar values in both columns 1 and 2.

I have written a code where it sums up the value but I do not know how to print the groups values.

use strict;
use warnings;
use Data::Dumper;
my $file=shift;
open (DATA, $file);
my %score_by_year;

while ( my $line = <DATA> )
{
        my ($protein, $year, $score) = split /\s+/, $line;
        $score_by_year{$year} +=$score;
        print "$protein\t$year\t$score_by_year{$year}\n";
}
close DATA;

so my code gives output as:

b1  1995    1
b1  2007    0.1
b2  1974    0.1
b2  1974    0.7
b2  1975    0.3

whereas, the expected output is this:

b1  1995    1
b1  2007    0.1
b2  1974    0.7
b2  1975    0.3
3
  • 1
    Tip: Don't use global vars for file handles, especially not DATA (which already has meaning). Use lexical vars. /// Don't use 2-arg open. /// Check the result of open cause it's a frequent source of failure. open(my $fh, '<', $qfn) or die("Can't open \"$qfn\": $!\n"); Commented May 24, 2019 at 4:36
  • Tip: split ' ', $line almost always makes more sense than split /\s+/, $line. Though if your input is tab-separated like your output split /\t/, $line would be the appropriate solution here. Commented May 24, 2019 at 4:37
  • 1
    Heretical non-perl approach using the ever-useful gnu datamash: datamash groupby 1,2 sum 3 < input.tsv. (If your real input isn't already sorted the way your sample is, add -s). Commented May 24, 2019 at 8:52

1 Answer 1

1

To keep the sequence, store it:

use strict;
use warnings;

my @sequence;
my %scores_by_year;

while (<DATA>) {
   my ($protein, $year, $score) = split;
   if (not exists $scores_by_year{$protein}{$year}) {
     push @sequence, [$protein, $year];
   }
   $scores_by_year{$protein}{$year} += $score;
}

for my $protein_year (@sequence) {
  my($protein, $year)= @$protein_year;
  print join("\t", $protein, $year, $scores_by_year{$protein}{$year}), "\n";
}
__DATA__
b1  1995    1
b1  2007    0.1
b2  1974    0.1
b2  1974    0.7
b2  1975    0.3
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.