9

I have a lot of text files with fixed-width fields:

<c>     <c>       <c>
Dave    Thomas    123 Main
Dan     Anderson  456 Center
Wilma   Rainbow   789 Street

The rest of the files are in a similar format, where the <c> will mark the beginning of a column, but they have various (unknown) column & space widths. What's the best way to parse these files?

I tried using Text::CSV, but since there's no delimiter it's hard to get a consistent result (unless I'm using the module wrong):

my $csv = Text::CSV->new();
$csv->sep_char (' ');

while (<FILE>){
    if ($csv->parse($_)) {
        my @columns=$csv->fields();
        print $columns[1] . "\n";
    }
}
6
  • 1
    Why do you object to the "parsing" tag? This is a parsing problem. That you require a solution in Perl does not mean it is not a parsing problem. Commented Feb 6, 2011 at 2:31
  • because I don't want a general solution Commented Feb 6, 2011 at 2:33
  • maybe I misunderstood...I thought putting "parsing" in there would bring a whole bunch of solutions that aren't relevant to my situation (ie python, php, etc)....thx Commented Feb 6, 2011 at 2:42
  • I am going to guess there is one (or two or three or...) module(s) on CPAN for this? As far as the dynamic widths, just build the appropriate "templates" up dynamically once the headers are read -- or does the width depend upon something absolutely insane like the max width of the data per column? Commented Feb 6, 2011 at 3:42
  • @pst - See my answer. CPAN has a module that not only parses, but can determine width automatically for you (heuristically) :) Commented Feb 6, 2011 at 12:53

3 Answers 3

12

As user604939 mentions, unpack is the tool to use for fixed width fields. However, unpack needs to be passed a template to work with. Since you say your fields can change width, the solution is to build this template from the first line of your file:

my @template = map {'A'.length}        # convert each to 'A##'
               <DATA> =~ /(\S+\s*)/g;  # split first line into segments
$template[-1] = 'A*';                  # set the last segment to be slurpy

my $template = "@template";
print "template: $template\n";

my @data;
while (<DATA>) {
    push @data, [unpack $template, $_]
}

use Data::Dumper;

print Dumper \@data;

__DATA__
<c>     <c>       <c>
Dave    Thomas    123 Main
Dan     Anderson  456 Center
Wilma   Rainbow   789 Street

which prints:

template: A8 A10 A*
$VAR1 = [
          [
            'Dave',
            'Thomas',
            '123 Main'
          ],
          [
            'Dan',
            'Anderson',
            '456 Center'
          ],
          [
            'Wilma',
            'Rainbow',
            '789 Street'
          ]
        ];
Sign up to request clarification or add additional context in comments.

Comments

6

CPAN to the rescue!

DataExtract::FixedWidth not only parses fixed-width files, but (based on POD) appears to be smart enough to figure out column widths from header line by itself!

2 Comments

BTW, the author hangs out here on SO once in a while.
DVK++ =) thanks! DE:FW is also well tested with tons of test input.
3

Just use Perl's unpack function. Something like this:

while (<FILE>) {
    my ($first,$last,$street) = unpack("A9A25A50",$_);

    <Do something ....>
}

Inside the unpack template, the "A###", you can put the width of the field for each A. There are a variety of other formats that you can use to mix and match with, that is, integer fields, etc... If the file is fixed width, like mainframe files, then this should be the easiest.

1 Comment

that's part of my question...the width of the field will change depending on the file that I feed it. Is there a way for unpack to detect the width from the header?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.