Parse fixed-width files

Question

I have a lot of text files with fixed-width fields:

<c>     <c>       <c>
Dave    Thomas    123 Main
Dan     Anderson  456 Center
Wilma   Rainbow   789 Street

The rest of the files are in a similar format, where the <c> will mark the beginning of a column, but they have various (unknown) column & space widths. What's the best way to parse these files?

I tried using Text::CSV, but since there's no delimiter it's hard to get a consistent result (unless I'm using the module wrong):

my $csv = Text::CSV->new();
$csv->sep_char (' ');

while (<FILE>){
    if ($csv->parse($_)) {
        my @columns=$csv->fields();
        print $columns[1] . "\n";
    }
}

Why do you object to the "parsing" tag? This is a parsing problem. That you require a solution in Perl does not mean it is not a parsing problem. — zwol
– zwol, Commented Feb 6, 2011 at 2:31
maybe I misunderstood...I thought putting "parsing" in there would bring a whole bunch of solutions that aren't relevant to my situation (ie python, php, etc)....thx — user_78361084
– user_78361084, Commented Feb 6, 2011 at 2:42
I am going to guess there is one (or two or three or...) module(s) on CPAN for this? As far as the dynamic widths, just build the appropriate "templates" up dynamically once the headers are read -- or does the width depend upon something absolutely insane like the max width of the data per column? — user166390
– user166390, Commented Feb 6, 2011 at 3:42
@pst - See my answer. CPAN has a module that not only parses, but can determine width automatically for you (heuristically) :) — DVK
– DVK, Commented Feb 6, 2011 at 12:53

Eric Strom · Accepted Answer · 2011-02-06 02:53:46Z

As user604939 mentions, unpack is the tool to use for fixed width fields. However, unpack needs to be passed a template to work with. Since you say your fields can change width, the solution is to build this template from the first line of your file:

my @template = map {'A'.length}        # convert each to 'A##'
               <DATA> =~ /(\S+\s*)/g;  # split first line into segments
$template[-1] = 'A*';                  # set the last segment to be slurpy

my $template = "@template";
print "template: $template\n";

my @data;
while (<DATA>) {
    push @data, [unpack $template, $_]
}

use Data::Dumper;

print Dumper \@data;

__DATA__
<c>     <c>       <c>
Dave    Thomas    123 Main
Dan     Anderson  456 Center
Wilma   Rainbow   789 Street

which prints:

template: A8 A10 A*
$VAR1 = [
          [
            'Dave',
            'Thomas',
            '123 Main'
          ],
          [
            'Dan',
            'Anderson',
            '456 Center'
          ],
          [
            'Wilma',
            'Rainbow',
            '789 Street'
          ]
        ];

DVK · Accepted Answer · 2011-02-06 12:53:06Z

6

CPAN to the rescue!

DataExtract::FixedWidth not only parses fixed-width files, but (based on POD) appears to be smart enough to figure out column widths from header line by itself!

answered Feb 6, 2011 at 12:53

DVK

130k33 gold badges219 silver badges337 bronze badges

2 Comments

DVK Over a year ago

BTW, the author hangs out here on SO once in a while.

Evan Carroll Over a year ago

DVK++ =) thanks! DE:FW is also well tested with tons of test input.

Peter Mortensen · Accepted Answer · 2014-05-02 16:05:20Z

3

Just use Perl's unpack function. Something like this:

while (<FILE>) {
    my ($first,$last,$street) = unpack("A9A25A50",$_);

    <Do something ....>
}

Inside the unpack template, the "A###", you can put the width of the field for each A. There are a variety of other formats that you can use to mix and match with, that is, integer fields, etc... If the file is fixed width, like mainframe files, then this should be the easiest.

edited May 2, 2014 at 16:05

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Feb 6, 2011 at 2:26

user604939

311 bronze badge

1 Comment

user_78361084 Over a year ago

that's part of my question...the width of the field will change depending on the file that I feed it. Is there a way for unpack to detect the width from the header?

Collectives™ on Stack Overflow

Parse fixed-width files

3 Answers 3

Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related