Grouping tabular data by multiple columns

Question

I have the data in this format

b1  1995    1
b1  2007    0.1
b2  1974    0.1
b2  1974    0.6
b2  1975    0.3

And, I want to sum values in column 3 based on similar values in both columns 1 and 2.

I have written a code where it sums up the value but I do not know how to print the groups values.

use strict;
use warnings;
use Data::Dumper;
my $file=shift;
open (DATA, $file);
my %score_by_year;

while ( my $line = <DATA> )
{
        my ($protein, $year, $score) = split /\s+/, $line;
        $score_by_year{$year} +=$score;
        print "$protein\t$year\t$score_by_year{$year}\n";
}
close DATA;

so my code gives output as:

b1  1995    1
b1  2007    0.1
b2  1974    0.1
b2  1974    0.7
b2  1975    0.3

whereas, the expected output is this:

b1  1995    1
b1  2007    0.1
b2  1974    0.7
b2  1975    0.3

Tip: Don't use global vars for file handles, especially not DATA (which already has meaning). Use lexical vars. /// Don't use 2-arg open. /// Check the result of open cause it's a frequent source of failure. open(my $fh, '<', $qfn) or die("Can't open \"$qfn\": $!\n"); — ikegami
– ikegami, Commented May 24, 2019 at 4:36
Tip: split ' ', $line almost always makes more sense than split /\s+/, $line. Though if your input is tab-separated like your output split /\t/, $line would be the appropriate solution here. — ikegami
– ikegami, Commented May 24, 2019 at 4:37
Heretical non-perl approach using the ever-useful gnu datamash: datamash groupby 1,2 sum 3 < input.tsv. (If your real input isn't already sorted the way your sample is, add -s). — Shawn
– Shawn, Commented May 24, 2019 at 8:52

Skeeve · Accepted Answer · 2019-05-24 05:38:27Z

1

To keep the sequence, store it:

use strict;
use warnings;

my @sequence;
my %scores_by_year;

while (<DATA>) {
   my ($protein, $year, $score) = split;
   if (not exists $scores_by_year{$protein}{$year}) {
     push @sequence, [$protein, $year];
   }
   $scores_by_year{$protein}{$year} += $score;
}

for my $protein_year (@sequence) {
  my($protein, $year)= @$protein_year;
  print join("\t", $protein, $year, $scores_by_year{$protein}{$year}), "\n";
}
__DATA__
b1  1995    1
b1  2007    0.1
b2  1974    0.1
b2  1974    0.7
b2  1975    0.3

answered May 24, 2019 at 5:38

Skeeve

8,6422 gold badges21 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Grouping tabular data by multiple columns

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related