Manipulating multiple lines with Perl

Question

I have a few-hundreds of lines file of the form

1st  2n  2p  3n  3p  4n  4p
1ABJa  2  20  8  40  3  45
1ABJb  2  40  8  80  3  45
2C3Da  4  50  5  39  2  90
2D4Da  1  10  8  90  8  65

(tab separated file)

From this file, I want to manipulate all lines that have a similar 4 beginning characters in the 1st column (i.e. 1ABJa and 1ABJb) and do:

for column 1 merge both names maintaining the common characters;
for columns 2n, 3n, 4n... the numbers would be summed;
for columns 2p, 3p, 4p, ... the numbers would be averaged.

(note that this can be specified by column position and not name). This would then yield:

1st  2n  2p  3n  3p  4n  4p
1ABJab  4  30  16  60  6  45       
2C3Da  4  50  5  39  2  90
2D4Da  1  10  8  90  8  65

How would you solve this?

This is probably the most complicated way to do this, but here it goes: I am thinking about creating an array of all 4-character unique elements of the 1st column. Then, for that array, running a loop that finds all instances matching those 4 characters. If there are more than 1 instance, identify them, push the columns, and manipulate them. Here's the point that I got until now:

#!/usr/local/bin/perl
use strict;
use warnings;
use feature 'say';
use List::MoreUtils qw(uniq);

my $dir='My\\Path\\To\\Directory';
open my $in,"<", "$dir\\my file.txt" or die;
my @uniqarray; my @lines;

#collects unique elements in 1st column and changes them to 4-character words
while (my $line = <$in>) {
    chomp $line;
    @lines= split '\t', $line;
    if (!grep /$lines[0]/, @uniqarray ){
        $lines[0] =~ s/^(.{4}).*/$1/;
        push @uniqarray,$lines[0];
    }
}

my @l;
#for @uniqarray, find all rows in the input that match them. if more than 1 row is found, manipulate the columns
while (my $something=<$in>) {
    chomp $something;
    @l= split '\t', $something;
    if ( map $something =~ m/$_/,@uniqarray){
        **[DO STUFF]**
    }
}

print join "\n", uniq(@uniqarray);

close $in;

In your example output, why is the first row 1ABJab? You haven't specified a rule, so it seems like it could just as easily be 1ABJa. — ThisSuitIsBlackNot
– ThisSuitIsBlackNot, Commented Apr 1, 2014 at 15:19
I gave it the name 1ABJab because it contains data from both 1ABJa and 1ABJb, and I want to distinguish it from the other rows. I will add the rule for this. Thanks! — Sos
– Sos, Commented Apr 1, 2014 at 15:21
The only hard part is putting togeter the end result, since the yeild's looks like after the fact, the results are merged with the lines that aren't analyzed. — user557597
– user557597, Commented Apr 1, 2014 at 15:25
Nevermind, I read that too fast...I was thinking you used the name of just one of the rows (e.g. 1ABJb), not a combination. — ThisSuitIsBlackNot
– ThisSuitIsBlackNot, Commented Apr 1, 2014 at 15:25
'D:\' is incorrect code, the backslash will escape your closing quote. Which is quite visible in the Markdown formatting above. — TLP
– TLP, Commented Apr 1, 2014 at 15:26

Toto · Accepted Answer · 2014-04-01 16:15:30Z

2

How about:

my $result;
my $head = <DATA>;
while(<DATA>) {
    chomp;
    my @l = split/\s+/;
    my ($k1,$k2) = ($l[0] =~ /^(....)(.*)$/);
    $result->{$k1}{more} .= $k2 // '';
    $result->{$k1}{nbr}++;

    ;
    $result->{$k1}{n}{2} += $l[1];
    $result->{$k1}{n}{3} += $l[3];
    $result->{$k1}{n}{4} += $l[5];
    $result->{$k1}{p}{2} += $l[2];
    $result->{$k1}{p}{3} += $l[4];
    $result->{$k1}{p}{4} += $l[6];
}

print $head;
foreach my $k (keys %$result) {
    print $k,$result->{$k}{more},"\t";
    for my $c (2,3,4) {
        printf("%d\t",$result->{$k}{n}{$c});
        if (exists($result->{$k}{nbr}) && $result->{$k}{nbr} != 0) {
            printf("%d\t",$result->{$k}{p}{$c}/$result->{$k}{nbr});
        } else {
            printf("%d\t",0);
        }
    }
    print "\n";
}

output:

1st     2n  2p  3n  3p  4n  4p
2D4Da   1   10  8   90  8   65  
1ABJab  4   30  16  60  6   45  
2C3Da   4   50  5   39  2   90

answered Apr 1, 2014 at 16:15

Toto

91.7k63 gold badges97 silver badges135 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Sos Over a year ago

eheh actually it reminds me of your answer in this thread! I still need to get the hang of these hashes' references. Just a couple of questions (and please redirect me to the documentation if it is simpler, I am trying to learn): 1. $result->{$k1}{n}{2} means $result->{$k1}? 2.What can you do with the names you gave (i.e. n or 2)?

Borodin Over a year ago

@Sosi: I don't follow. Why should you think that $result->{$k1}{n}{2} is the same as $result->{$k1}? And the values used as hash keys, $k1, n and 2 are simple strings. A bareword like n is implicitly quoted when is appears as a hash key. You can do anything with them that you can do with strings.

Borodin Over a year ago

@M42: The OP's "for columns 2n, 3n, 4n... the numbers would be summed" implies to me that there a probably more than the six columns of data shown in the example

Toto Over a year ago

@Borodin: Yes, I guess you're right, but I leave the rest as an exercice. He just needs to reorganize with some loops.

Sos Over a year ago

@Borodin indeed, I'm using about 40 columns. But I'll see if I can expand from what M42 did!

|

Borodin · Accepted Answer · 2014-04-01 16:44:59Z

1

This appears to do what you need. It keeps a set of data in a hash for each distinct four-character prefix: a count of the number of records with the same prefix under key n, an array that holds the column totals for that prefix under key totals, and a hash with all the suffixes seen for that prefix under key suffixes.

Prefixes are added to the array @prefixes the first time they are seen, so that the output can be presented in the same order as the input.

It is simply a matter of accumulating the data and then dumping it in the required format, after dividing all the even-numbers columns of the totals array by n.

use strict;
use warnings;

open my $fh, '<', 'data.txt' or die $!;

print scalar <$fh>; # Copy header

my %data;
my @prefixes;

while (<$fh>) {
  chomp;
  my @fields = split /\t/;
  my ($prefix, $suffix) = shift(@fields) =~ /(.{4})(.*)/;
  push @prefixes, $prefix unless $data{$prefix};
  ++$data{$prefix}{n};
  ++$data{$prefix}{suffixes}{$suffix};
  $data{$prefix}{totals}[$_] += $fields[$_] for 0 .. $#fields;
}

for my $prefix (@prefixes) {
  my $val      = $data{$prefix};
  my $totals   = $val->{totals};
  for (my $i = 1; $i < @$totals; $i += 2) {
    $totals->[$i] /= $val->{n};
  }
  my $suffixes = join '', sort keys %{ $val->{suffixes} };
  print join("\t", "$prefix$suffixes", @$totals), "\n";
}

output

1st     2n  2p  3n  3p  4n  4p
1ABJab  4   30  16  60  6   45
2C3Da   4   50  5   39  2   90
2D4Da   1   10  8   90  8   65

edited Apr 1, 2014 at 16:44

answered Apr 1, 2014 at 16:39

Borodin

127k9 gold badges72 silver badges146 bronze badges

4 Comments

Sos Over a year ago

wow this is really elegant! I need to have a good look at those array refs that you used tomorrow! thanks

Sos Over a year ago

ok, now I got it! thanks for the help! It works perfectly even in my real case

Sos Over a year ago

++$data{$prefix}{n} could also be $data{$prefix}{n}++, or when creating that hash value do you need it to be incremented before? I know that -> has operator precedence over ++ so both ways are equivalent right?

Borodin Over a year ago

@Sosi: Yes, the two are equivalent. The only reason people use the postfixed $x++ by default when it doesn't matter is because of the name of the language C++. It's useful to reverse that habit as there are some languages that do the extra work to preserve the pre-increment value even when it isn't used.

Collectives™ on Stack Overflow

Manipulating multiple lines with Perl

2 Answers 2

7 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related