Perl Scripting | using HASHes

Question

I have an input file which looks like below

 =IP1
abc[0]
abc[1]
abc[2]
=IP2
def[4]
def[8]
def[9]

I need to get the output in the below format -

=IP1
abc[0-2]
=IP2
def[4,8-9]

I have been trying to achieve the above using hashes where I read each line of the file and then split(with'[') each line, I keep the first part as key and read the file again to keep the values in an array for the hash keys. But I am getting stuck in a loop. Can anyone provide help on how to achieve the above ?

Do all the values after each = line start with the same string? For instance, do all the lines after =IP1 always start with abc or can there be a mixture? And are the numbers in brackets always in increasing order? — Borodin
– Borodin, Commented Aug 3, 2015 at 4:08
No, values after =, start with a different name. There is a mixture. Secondly Yes, the numbers inside the brackets are always in the increasing order. — Ravi Raj
– Ravi Raj, Commented Aug 3, 2015 at 8:09
Okay, then are the same values all grouped together -- say, all the abcs and then all the xyzs and then =IP2. And are the numbers in brackets always in sorted order? — Borodin
– Borodin, Commented Aug 3, 2015 at 8:13
And can different groups have the same tags? For instance, can both =IP3 and =IP4 have xyz values? — Borodin
– Borodin, Commented Aug 3, 2015 at 8:15
Yes , all the abc s and all the xyz s are grouped together. Yes the numbers in the brackets are always in sorted order. Different groups don't have the same tags. — Ravi Raj
– Ravi Raj, Commented Aug 3, 2015 at 8:53

Community · Accepted Answer · 2017-05-23 10:26:49Z

There are several interesting subproblems. First, you want to keep track of the most recent header (ie, =IP1). Second, you want to keep track of lists of numbers that are associated with some keys, and third, you want to generate range strings.

Here's how I would do it:

#!/usr/bin/env perl

use strict;
use warnings;

my $tl;
my %h;

# First process the lines of the input file.
while(<DATA>) {
    chomp;
    next unless length;
    if(/^(=\w{2}\d+)$/) { # Recognize and track a top level heading.
        $tl = $1;
        next;
    }
    if(/^(\w+)\[(\d+)\]$/) {  # Or grab a key/value pair.
        my($k,$v) = ($1,$2);
        push @{$h{$tl}{$k}}, $v; # push the value into the right bucket.
        next;
    }
    warn "Unrecognized format cannot be processed at $.: (($_))\n";
}

# Sort the top level headers alphabetically and numerically.
# Uses a Schwartzian Transform so that we don't need to recompute
# sort keys repeatedly.
my @topkeys = map  {$_->[0]}
              sort {$a->[1] cmp $b->[1] || $a->[2] <=> $b->[2]} 
              map  {
                my($alpha, $num) = $_ =~ m/^=(\w+)(\d+)$/; 
                [$_, $alpha, $num]
              } keys %h;

# Now iterate through the structure in sorted order, generate range
# strings on the fly, and print our output.
foreach my $top (@topkeys) {
    print "$top\n";
    foreach my $k (sort keys %{$h{$top}}) {
        my @vl = sort {$a <=> $b} @{$h{$top}{$k}};
        my $range = num2range(@vl);
        print "$k\[$range]\n";
    }
}

sub num2range {
  local $_ = join ',' => @_;
  s/(?<!\d)(\d+)(?:,((??{$++1}))(?!\d))+/$1-$+/g;
  return $_;
}

__DATA__
=IP1
abc[0]
abc[1]
abc[2]
=IP2
def[4]
def[8]
def[9]

The following output is produced:

=IP1
abc[0-2]
=IP2
def[4,8-9]

This solution could be optimized further if answers to some of the questions that Borodin asked as a comment to your original post were answered. For example, it would be unnecessary to sort our number list before generating a range if we knew that the numbers were already in order. And some complexity (and computational work) might be eliminated if we knew more about what "abc" and "def" are. And if sorted order doesn't matter, we could simplify further while also reducing the amount of work being done.

Also, the Set::IntSpan module could probably provide a more robust approach to generate a range string, and is probably worth considering if this script is intended to live beyond the "one off" lifespan. If you choose to use Set::IntSpan your num2range sub could look like this:

sub num2range{ return Set::IntSpan->new(@_) }

The Set::IntSpan object has overloaded stringification, so printing it gives a text representation of the range. If you went this route, you could eliminate the code that sorts the lists of numbers -- that's handled by Set::IntSpan internally.

Thanks a ton! Well I am pretty new to PERL, so can there be a bit easier lines of code for achieving the same?
@RaviRaj: It's Perl ! And in general it's a mistake to invent data that you think behaves the same way as your real data; you usually end up with people asking a lot of questions (as I had to) or the OP has to reject perfectly good answers with "thanks, but it's not quite like that"

Borodin · Accepted Answer · 2015-08-03 08:57:22Z

Okay, here's my take on a solution. Without any better information on the incoming data it may be more complicated than necessary

It keeps the data -- both the =IP headers and the xyz[9] values in the same order that they are first encountered. I've separated out the generation of the number range contraction to subroutine ranges

It's simply a matter of reading the data in from the file -- which it expects as a parameter on the command line -- into data structures %data and @order and printing them out again. The @order array and the _order subkey of the hash are there to preserve the sequence that the values are encountered and are added to whenever a new key is inserted into the corresponding hash

use strict;
use warnings;

my ($key, %data, @order);

while ( <> ) {

  chomp;

  if ( /^=/ ) {
    $key = $_;
    push @order, $key unless $data{$key};
    $data{$key} = { _order => [] };
  }
  elsif ( my ($key2, $n) = /([^\[\]\s]+)\[(\d+)\]/ ) {
    my $data = $data{$key};
    push @{ $data->{_order} }, $key2 unless $data->{$key2};
    push @{ $data->{$key2} }, $n; 
  }
}

for my $key ( @order ) {

    print $key, "\n";

    my $data = $data{$key};

    for my $key2 ( @{ $data->{_order} } ) {
        printf "%s[%s]\n", $key2, ranges( sort { $a <=> $b } @{ $data->{$key2} } );
    }

}

sub ranges {

    my @ranges;
    my ($start, $end);

    for my $n ( @_ ) {
        if ( not defined $start ) {
            $start = $end = $n;
        }
        elsif ( $n == $end + 1 ) {
            $end = $n;
        }
        else {
            push @ranges, $start == $end ? $start : "$start-$end";
            $start = $end = $n;
        }
    }

    push @ranges, $start == $end ? $start : "$start-$end" if defined $start;
    join ',', @ranges;
}

output

=IP1
abc[0-2]
=IP2
def[4,8-9]

Collectives™ on Stack Overflow

Perl Scripting | using HASHes

2 Answers 2

2 Comments

output

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

output

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related