1

I have a few-hundreds of lines file of the form

1st  2n  2p  3n  3p  4n  4p
1ABJa  2  20  8  40  3  45
1ABJb  2  40  8  80  3  45
2C3Da  4  50  5  39  2  90
2D4Da  1  10  8  90  8  65

(tab separated file)

From this file, I want to manipulate all lines that have a similar 4 beginning characters in the 1st column (i.e. 1ABJa and 1ABJb) and do:

  • for column 1 merge both names maintaining the common characters;
  • for columns 2n, 3n, 4n... the numbers would be summed;
  • for columns 2p, 3p, 4p, ... the numbers would be averaged.

(note that this can be specified by column position and not name). This would then yield:

1st  2n  2p  3n  3p  4n  4p
1ABJab  4  30  16  60  6  45       
2C3Da  4  50  5  39  2  90
2D4Da  1  10  8  90  8  65

How would you solve this?

This is probably the most complicated way to do this, but here it goes: I am thinking about creating an array of all 4-character unique elements of the 1st column. Then, for that array, running a loop that finds all instances matching those 4 characters. If there are more than 1 instance, identify them, push the columns, and manipulate them. Here's the point that I got until now:

#!/usr/local/bin/perl
use strict;
use warnings;
use feature 'say';
use List::MoreUtils qw(uniq);

my $dir='My\\Path\\To\\Directory';
open my $in,"<", "$dir\\my file.txt" or die;
my @uniqarray; my @lines;

#collects unique elements in 1st column and changes them to 4-character words
while (my $line = <$in>) {
    chomp $line;
    @lines= split '\t', $line;
    if (!grep /$lines[0]/, @uniqarray ){
        $lines[0] =~ s/^(.{4}).*/$1/;
        push @uniqarray,$lines[0];
    }
}

my @l;
#for @uniqarray, find all rows in the input that match them. if more than 1 row is found, manipulate the columns
while (my $something=<$in>) {
    chomp $something;
    @l= split '\t', $something;
    if ( map $something =~ m/$_/,@uniqarray){
        **[DO STUFF]**
    }
}

print join "\n", uniq(@uniqarray);

close $in;
17
  • In your example output, why is the first row 1ABJab? You haven't specified a rule, so it seems like it could just as easily be 1ABJa. Commented Apr 1, 2014 at 15:19
  • I gave it the name 1ABJab because it contains data from both 1ABJa and 1ABJb, and I want to distinguish it from the other rows. I will add the rule for this. Thanks! Commented Apr 1, 2014 at 15:21
  • The only hard part is putting togeter the end result, since the yeild's looks like after the fact, the results are merged with the lines that aren't analyzed. Commented Apr 1, 2014 at 15:25
  • Nevermind, I read that too fast...I was thinking you used the name of just one of the rows (e.g. 1ABJb), not a combination. Commented Apr 1, 2014 at 15:25
  • 'D:\' is incorrect code, the backslash will escape your closing quote. Which is quite visible in the Markdown formatting above. Commented Apr 1, 2014 at 15:26

2 Answers 2

2

How about:

my $result;
my $head = <DATA>;
while(<DATA>) {
    chomp;
    my @l = split/\s+/;
    my ($k1,$k2) = ($l[0] =~ /^(....)(.*)$/);
    $result->{$k1}{more} .= $k2 // '';
    $result->{$k1}{nbr}++;

    ;
    $result->{$k1}{n}{2} += $l[1];
    $result->{$k1}{n}{3} += $l[3];
    $result->{$k1}{n}{4} += $l[5];
    $result->{$k1}{p}{2} += $l[2];
    $result->{$k1}{p}{3} += $l[4];
    $result->{$k1}{p}{4} += $l[6];
}

print $head;
foreach my $k (keys %$result) {
    print $k,$result->{$k}{more},"\t";
    for my $c (2,3,4) {
        printf("%d\t",$result->{$k}{n}{$c});
        if (exists($result->{$k}{nbr}) && $result->{$k}{nbr} != 0) {
            printf("%d\t",$result->{$k}{p}{$c}/$result->{$k}{nbr});
        } else {
            printf("%d\t",0);
        }
    }
    print "\n";
}

output:

1st     2n  2p  3n  3p  4n  4p
2D4Da   1   10  8   90  8   65  
1ABJab  4   30  16  60  6   45  
2C3Da   4   50  5   39  2   90  
Sign up to request clarification or add additional context in comments.

7 Comments

eheh actually it reminds me of your answer in this thread! I still need to get the hang of these hashes' references. Just a couple of questions (and please redirect me to the documentation if it is simpler, I am trying to learn): 1. $result->{$k1}{n}{2} means $result->{$k1}? 2.What can you do with the names you gave (i.e. n or 2)?
@Sosi: I don't follow. Why should you think that $result->{$k1}{n}{2} is the same as $result->{$k1}? And the values used as hash keys, $k1, n and 2 are simple strings. A bareword like n is implicitly quoted when is appears as a hash key. You can do anything with them that you can do with strings.
@M42: The OP's "for columns 2n, 3n, 4n... the numbers would be summed" implies to me that there a probably more than the six columns of data shown in the example
@Borodin: Yes, I guess you're right, but I leave the rest as an exercice. He just needs to reorganize with some loops.
@Borodin indeed, I'm using about 40 columns. But I'll see if I can expand from what M42 did!
|
1

This appears to do what you need. It keeps a set of data in a hash for each distinct four-character prefix: a count of the number of records with the same prefix under key n, an array that holds the column totals for that prefix under key totals, and a hash with all the suffixes seen for that prefix under key suffixes.

Prefixes are added to the array @prefixes the first time they are seen, so that the output can be presented in the same order as the input.

It is simply a matter of accumulating the data and then dumping it in the required format, after dividing all the even-numbers columns of the totals array by n.

use strict;
use warnings;

open my $fh, '<', 'data.txt' or die $!;

print scalar <$fh>; # Copy header

my %data;
my @prefixes;

while (<$fh>) {
  chomp;
  my @fields = split /\t/;
  my ($prefix, $suffix) = shift(@fields) =~ /(.{4})(.*)/;
  push @prefixes, $prefix unless $data{$prefix};
  ++$data{$prefix}{n};
  ++$data{$prefix}{suffixes}{$suffix};
  $data{$prefix}{totals}[$_] += $fields[$_] for 0 .. $#fields;
}

for my $prefix (@prefixes) {
  my $val      = $data{$prefix};
  my $totals   = $val->{totals};
  for (my $i = 1; $i < @$totals; $i += 2) {
    $totals->[$i] /= $val->{n};
  }
  my $suffixes = join '', sort keys %{ $val->{suffixes} };
  print join("\t", "$prefix$suffixes", @$totals), "\n";
}

output

1st     2n  2p  3n  3p  4n  4p
1ABJab  4   30  16  60  6   45
2C3Da   4   50  5   39  2   90
2D4Da   1   10  8   90  8   65

4 Comments

wow this is really elegant! I need to have a good look at those array refs that you used tomorrow! thanks
ok, now I got it! thanks for the help! It works perfectly even in my real case
++$data{$prefix}{n} could also be $data{$prefix}{n}++, or when creating that hash value do you need it to be incremented before? I know that -> has operator precedence over ++ so both ways are equivalent right?
@Sosi: Yes, the two are equivalent. The only reason people use the postfixed $x++ by default when it doesn't matter is because of the name of the language C++. It's useful to reverse that habit as there are some languages that do the extra work to preserve the pre-increment value even when it isn't used.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.