Parsing a structured text file in Perl

Question

I'm quite new to Perl and I'm having immense difficulty writing a Perl script that will successfully parse a structured text file.

I have a collection of files that look like this:

name:
    John Smith
occupation:
    Electrician
date of birth:
    2/6/1961
hobbies:
    Boating
    Camping
    Fishing

And so on. The field name is always followed by a colon, and all the data associated with those fields is always indented by a single tab (\t).

I would like to create a hash that will directly associate the field contents with the field name, like this:

 $contents{$name} = "John Smith"
 $contents{$hobbies} = "Boating, Camping, Fishing"

Or something along those lines.

So far I've been able to get all the field names into a hash by themselves, but I've not had any luck wrangling the field data into a form that can be nicely stored in a hash. Clearly substituting/splitting newlines followed by tabs won't work (I've tried, somewhat naively). I've also tried a crude lookahead where I create a duplicate array of lines from the file and using that to figure out where the field boundaries are, but it's not that great in terms of memory consumption.

FWIW, currently I'm going through the file line by line, but I'm not entirely convinced that this is the best solution. Is there any way to do this parsing in a straightforward manner?

Miller · Accepted Answer · 2014-10-11 16:43:15Z

6

Reading the file line by line is a good way to go. Here I am creating a hash of array references. This is how you would just read one file. You could read each file this way and put the hash of arrays into a hash of hashes of array.

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

my %contents;
my $key;
while(<DATA>){
    chomp;
    if ( s/:\s*$// ) {
        $key = $_;
    } else {
        s/^\s+//g; # remove extra whitespace
        push @{$contents{$key}}, $_;
    }
}
print Dumper \%contents;

__DATA__
name:
    John Smith
occupation:
    Electrician
date of birth:
    2/6/1961
hobbies:
    Boating
    Camping
    Fishing

Output:

$VAR1 = {
          'occupation' => [
                             'Electrician'
                           ],
          'hobbies' => [
                          'Boating',
                          'Camping',
                          'Fishing'
                        ],
          'name' => [
                       'JohnSmith'
                     ],
          'date of birth' => [
                                '2/6/1961'
                              ]
        };

edited Oct 11, 2014 at 16:43

Miller

35.3k4 gold badges42 silver badges61 bronze badges

answered Oct 11, 2014 at 15:12

hmatt1

5,2194 gold badges33 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

i alarmed alien Over a year ago

Probably best not to remove all the extra whitespace with s/\s+//g; - it's useful in names! ;)

hmatt1 Over a year ago

@ialarmedalien looks like Miller updated it to leading whitespace. Definitely screws up names, good call. Just wanted to throw something there to show where you could do processing on the elements if you needed to!

MARS Over a year ago

This is exactly what I needed - it's so much easier than I thought it would be. Thanks!

amon · Accepted Answer · 2014-10-27 08:53:08Z

This text file is actually quite close to yaml. And its not difficult to convert it into a valid yaml file:

Once you have a yaml file you can use YAML::Tiny or another module to parse it, which leads to cleaner code:

#!/usr/bin/perl
use strict;
use warnings;

use YAML::Tiny;
use Data::Dumper;

convert( './data.yaml', 'output.yaml' );
parse('output.yaml');

sub parse {
    my $yaml    = shift;
    my $yamlobj = YAML::Tiny->read($yaml);

    my $name    = $yamlobj->[0]->{name}[0];
    my $occ     = $yamlobj->[0]{occupation}[0];
    my $birth   = $yamlobj->[0]{'date of birth'}[0];
    my $hobbies = $yamlobj->[0]{hobbies};

    my $hobbiestring = join ", ", @$hobbies;

    my $contents = {
        name       => $name,
        occupation => $occ,
        birth      => $birth,
        hobbies    => $hobbiestring,
    };

    print "#RESULT:\n\n";
    print Dumper($contents);
}

sub convert {
    my ( $input, $output ) = @_;

    open my $infh,  '<', $input  or die "$!";
    open my $outfh, '>', $output or die "$!";

    while ( my $line = <$infh> ) {
        $line =~ s/^\s+\K$/-/g;
        print $outfh ($line);
    }
}

Collectives™ on Stack Overflow

Parsing a structured text file in Perl

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related