3

I want to read and save the content of a file in a hash of array. The first column of each row would be the key. Then, I want to read the files in a directory and add the file name to the end of the array according to the key!

file ($file_info)

AANB    John    male
S00V    Sara    female
SBBA    Anna    female

files in the directory:

AANB.txt
SBBA.txt
S00V.txt

expected output:

AANB    John    male    AANB.txt
S00V    Sara    female  S00V.txt
SBBA    Anna    female  SBBA.txt

Here's the script itself:

#!/usr/bin/perl

use strict;
use warnings;

my %all_samples=();
my $file_info = $ARGV[0];

open(FH, "<$file_info");

while(<FH>) {
    chomp;
    my @line = split("\t| ", $_);

    push(@{$all_samples{$line[0]}}, $_);
}

my $dir = ".";
opendir(DIR, $dir);
my @files = grep(/\.txt$/,readdir(DIR));
closedir(DIR);

foreach my $file (@files) {
    foreach my $k (keys %all_samples){
        foreach my $element (@{ $all_samples{$k} }){
            my @element = split(' ', $element);
            if ($file =~ m/$element[0]/) {
                push @{$all_samples{$element}}, $file;
            }
            else {
                next;
            }
        }
    }

}

foreach my $k (keys %all_samples) {
    foreach my $element (@{ $all_samples{$k} }) {
        print $element,"\n";
    }
}

But the output is not what I expected

AANB    John    male
SBBA.txt1
S00V    Sara    female
SBBA    Anna    female
S00V.txt1
AANB.txt1

2 Answers 2

3

I think that

        my @element = split(' ', $element);
        if ($file =~ m/$element[0]/) {
            push @{$all_samples{$element}}, $file;
        }

Is not doing the right thing, so $all_samples{$element}} is a new arrayref. You're printing six one element arrays rather than three two element arrays.

But then it doesn't help that you're printing the array elements one per line.

I think that your final section should look more like this:

foreach my $k (keys %all_samples) {
    print join( "\t", @{ $all_samples{$k} } ) . "\n"
}

In general, I think that you're overcomplicating this script. Here's how I would write it:

#!/usr/bin/perl

use strict;
use warnings;

my $all_samples={};

while(<>) {
    chomp;
    # Note that I'm using variable names here to document
    # The format of the file being read. This makes for
    # easier trouble-shooting -- if a column is missing,
    # It's easier to tell that $file_base_name shouldn't be
    # 'Anna' than that $line[0] should not be 'Anna'.
    my ( $file_base_name, $given_name, $sex ) = split("\t", $_);
    push(@{$all_samples->{$file_base_name} }, ( $file_base_name, $given_name, $sex ) );
}

my $dir = ".";
opendir(DIR, $dir);
my @files = grep(/\.txt$/,readdir(DIR));
closedir(DIR);

FILE: foreach my $file (@files) {
    BASE: foreach my $base (keys %{$all_samples}){
        next BASE unless( $file =~ /$base/ );
        push @{$all_samples->{$base}}, $file;
    }
}

foreach my $k (keys %{$all_samples} ) {
    print join( "\t", @{ $all_samples->{$k} } ) . "\n";
}

I prefer hashrefs to hashes, simply because I tend to deal with nested data structures -- I'm simply more used to seeing $all_samples->{$k} than $all_samples{$k}... more importantly, I'm using the full power of the arrayref, meaning that I'm not having to re-split the array that's already been split once.

G. Cito brings up an interesting point: why did I use

push(@{$all_samples->{$file_base_name} }, ( $file_base_name, $given_name, $sex ) );

Rather than

push(@{$all_samples->{$file_base_name} }, [ $file_base_name, $given_name, $sex ] );

There's nothing syntactically wrong with the latter, but it wasn't what I was trying to accomplish:

Let's look at what $all_samples->{$base} would look like after

push @{$all_samples->{$base}}, $file;

If the original push had been this:

push(@{$all_samples->{$file_base_name} }, [ $file_base_name, $given_name, $sex ] );

@{$all_samples->{$base}} would look like this:

(
    [ $file_base_name, $given_name, $sex ],
    $file
)

If instead, we use

push(@{$all_samples->{$file_base_name} }, ( $file_base_name, $given_name, $sex ) );

@{$all_samples->{$base}} looks like this after push @{$all_samples->{$base}}, $file:

(
    $file_base_name, 
    $given_name, 
    $sex, 
    $file
)

For instance:

(
    "AANB",
    "John",   
    "male",    
    "AANB.txt"
)

So when we print the array:

print join( "\t", @{ $all_samples->{$k} } ) . "\n";

Will print

AANB    John    male    AANB.txt
Sign up to request clarification or add additional context in comments.

1 Comment

Nice self documenting with variable names. Good practice (mea culpa!). Possible typo? don't you want [ $file_base_name, $given_name, $sex ] rather than ( $file_base_name, $given_name, $sex ) ?
1

Here is somewhat simpler way of creating the hash of arrays - reading from DATA here instead of a file only for convenience:

#!perl
use strict ;
use warnings ; 
use Data::Dumper ;

my $samples  ; 

while (<DATA>){
      chomp;
      map { $samples->{$_->[0]} = [@$_[1..2]] } [ split/\s+/ ];
 }

 push @{$samples->{$_}} , $_.".txt" for keys %$samples ;

 print  Dumper  \$samples ;

 __DATA__
AANB    John    male
S00V    Sara    female
SBBA    Anna    female

Since the filenames are known, you can just construct them from strings. Or is that not possible ? Perhaps confirming they exist with a file test (see perldoc -f -X) before pushing onto the array would avoid creating bad data but still allow you to build the entries this way.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.