What is the most efficient way to parse a text file using Perl?

Question

Although this is pretty basic, I can't find a similar question, so please link to one if you know of an existing question/solution on SO.

I have a .txt file that is about 2MB and about 16,000 lines long. Each record length is 160 characters with a blocking factor of 10. This is an older type of data structure which almost looks like a tab-delimited file, but the separation is by single-chars/white-spaces.

First, I glob a directory for .txt files - there is never more than one file in the directory at a time, so this attempt may be inefficient in itself.

my $txt_file = glob "/some/cheese/dir/*.txt";

Then I open the file with this line:

open (F, $txt_file) || die ("Could not open $txt_file");

As per the data dictionary for this file, I'm parsing each "field" out of each line using Perl's substr() function within a while loop.

while ($line = <F>)
{
$nom_stat   = substr($line,0,1);
$lname      = substr($line,1,15);
$fname      = substr($line,16,15);
$mname      = substr($line,31,1);
$address    = substr($line,32,30);
$city       = substr($line,62,20);
$st         = substr($line,82,2);
$zip        = substr($line,84,5);
$lnum       = substr($line,93,9);
$cl_rank    = substr($line,108,4);
$ceeb       = substr($line,112,6);
$county     = substr($line,118,2);
$sex        = substr($line,120,1);
$grant_type = substr($line,121,1);
$int_major  = substr($line,122,3);
$acad_idx   = substr($line,125,3);
$gpa        = substr($line,128,5);
$hs_cl_size = substr($line,135,4);
}

This approach takes a lot of time to process each line and I'm wondering if there is a more efficient way of getting each field out of each line of the file.

Can anyone suggest a more efficient/preferred method?

See stackoverflow.com/questions/1083269/… for some relevant benchmarks. — mob
– mob, Commented Mar 2, 2011 at 22:07
See stackoverflow.com/q/5083436#comment-5695536 for mob's list of dupes. — daxim
– daxim, Commented Mar 2, 2011 at 22:11

Joel Berger · Accepted Answer · 2011-03-02 22:04:16Z

8

It looks to me that you are working with fixed width fields here. Is that true? If it is, the unpack function is what you need. You provide the template for the fields and it will extract the info from those fields. There is a tutorial available, and the template information is found in the documentation for pack which is unpack's logical inverse. As a basic example simply:

my @values = unpack("A1 A15 A15 ...", $line);

where 'A' means any text character (as I understand it) and the number is how many. There is quite an art to unpack as some people use it, but I believe this will suffice for basic use.

edited Mar 2, 2011 at 22:04

answered Mar 2, 2011 at 21:42

Joel Berger

20.3k5 gold badges52 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Joel Berger Over a year ago

@daxim, thanks, I hope I used it correctly, I don't have much experience in writing templates for it.

CheeseConQueso Over a year ago

thanks joel... even though the thread that mob suggested shows that substr is better, it might be in only certain contexts. this unpack is new to me and seems logical that it would be preferred because they are indeed fixed length fields. I'll try it out... thanks

socket puppet Over a year ago

substr is not faster than unpack. Did you read the whole benchmark post?

CheeseConQueso Over a year ago

now i did.... im running the unpack method now... well see how it goes. i dont know why this script is taking over 2 hours to finish.... the dirty validation sql at the end doesn't take more than a few seconds

Ian C. Over a year ago

@CheeseConQueso: I included Benchmark, and an example of how to use it, in my answer so it'd be instructive. To understand why your program is taking a long time to run, Benchmark and Devel::DProf are invaluable tools in your Perl arsenal.

Ian C. · Accepted Answer · 2011-03-03 21:09:35Z

A single regular expression, compiled and cached using the /o option, is the fastest approach. I ran your code three ways using the Benchmark module and came out with:

         Rate unpack substr regexp
 unpack 2.59/s     --   -59%   -67%
 substr 6.23/s   141%     --   -21%
 regexp 7.90/s   206%    27%     --

Input was a file with 20k lines, each line had the same 160 characters on it (16 repetitions of the characters 0123456789). So it's the same input size as the data you're working with.

The Benchmark::cmpthese() method outputs the subroutine calls from slowest to fastest. The first column is telling us how many times per second the sub-routine can be run. The regular expression approach is fastest. Not unpack as I state previously. Sorry about that.

The benchmark code is below. The print statements are there as sanity checks. This was with Perl 5.10.0 built for darwin-thread-multi-2level.

#!/usr/bin/env perl
use Benchmark qw(:all);
use strict;

sub use_substr() {
    print "use_substr(): New itteration\n";
    open(F, "<data.txt") or die $!;
    while (my $line = <F>) {
        my($nom_stat, 
           $lname,   
           $fname,      
           $mname,    
           $address,     
           $city,    
           $st,       
           $zip,         
           $lnum,        
           $cl_rank,
           $ceeb,    
           $county,
           $sex,     
           $grant_type,
           $int_major, 
           $acad_idx,  
           $gpa,   
           $hs_cl_size) = (substr($line,0,1),
                           substr($line,1,15),
                           substr($line,16,15),
                           substr($line,31,1),
                           substr($line,32,30),
                           substr($line,62,20),
                           substr($line,82,2),
                           substr($line,84,5),
                           substr($line,93,9),
                           substr($line,108,4),
                           substr($line,112,6),
                           substr($line,118,2),
                           substr($line,120,1),
                           substr($line,121,1),
                           substr($line,122,3),
                           substr($line,125,3),
                           substr($line,128,5),
                           substr($line,135,4));
       #print "use_substr(): \$lname = $lname\n";
       #print "use_substr(): \$gpa   = $gpa\n";
    }    
    close(F);
    return 1;
}

sub use_regexp() {
    print "use_regexp(): New itteration\n";
    my $pattern = '^(.{1})(.{15})(.{15})(.{1})(.{30})(.{20})(.{2})(.{5})(.{9})(.{4})(.{6})(.{2})(.{1})(.{1})(.{3})(.{3})(.{5})(.{4})';
    open(F, "<data.txt") or die $!;
    while (my $line = <F>) {
        if ( $line =~ m/$pattern/o ) {
            my($nom_stat, 
               $lname,   
               $fname,      
               $mname,    
               $address,     
               $city,    
               $st,       
               $zip,         
               $lnum,        
               $cl_rank,
               $ceeb,    
               $county,
               $sex,     
               $grant_type,
               $int_major, 
               $acad_idx,  
               $gpa,   
               $hs_cl_size) = ( $1,
                                $2,
                                $3,
                                $4,
                                $5,
                                $6,
                                $7,
                                $8,
                                $9,
                                $10,
                                $11,
                                $12,
                                $13,
                                $14,
                                $15,
                                $16,
                                $17,
                                $18);
            #print "use_regexp(): \$lname = $lname\n";
            #print "use_regexp(): \$gpa   = $gpa\n";
        }
    }    
    close(F);
    return 1;
}

sub use_unpack() {
    print "use_unpack(): New itteration\n";
    open(F, "<data.txt") or die $!;
    while (my $line = <F>) {
        my($nom_stat, 
           $lname,   
           $fname,      
           $mname,    
           $address,     
           $city,    
           $st,       
           $zip,         
           $lnum,        
           $cl_rank,
           $ceeb,    
           $county,
           $sex,     
           $grant_type,
           $int_major, 
           $acad_idx,  
           $gpa,   
           $hs_cl_size) = unpack(
               "(A1)(A15)(A15)(A1)(A30)(A20)(A2)(A5)(A9)(A4)(A6)(A2)(A1)(A1)(A3)(A3)(A5)(A4)(A*)", $line
               );
        #print "use_unpack(): \$lname = $lname\n";
        #print "use_unpack(): \$gpa   = $gpa\n";
    }
    close(F);
    return 1;
}

# Benchmark it
my $itt = 50;
cmpthese($itt, {
        'substr' => sub { use_substr(); },
        'regexp' => sub { use_regexp(); },
        'unpack' => sub { use_unpack(); },
    }
);
exit(0)

thanks ian.... this is a good and comprehensive explanation... ill try them, but probably go with unpack since you did all the hard work already
I tried out your unpack method and it makes each iteration of the loop considerably slower. I'm not sure why.
nevermind - i found out.... still doesn't make it that much faster, but definitely helps out
Wow. I'm dense. I just noticed the first column in the cmpthese() has a /s in it, not s. That's a big deal. Updated my answer.
the regex wasn't parsing the lines properly... I'll throw up some sample input when I'm back at work

Geo · Accepted Answer · 2011-03-02 21:01:16Z

0

Do a split on each line, like this:

my @values = split(/\s/,$line);

and then work with your values.

answered Mar 2, 2011 at 21:01

Geo

97.6k121 gold badges356 silver badges536 bronze badges

2 Comments

Joel Berger Over a year ago

Further it will only work if the data is space separated, which as the OP uses one character from position zero, then 15 characters from position 1, then 15 characters from position 16, it doesn't appear that it is, unless my math is incorrect.

CheeseConQueso Over a year ago

i thought about a split, but thought that it was only used for fixed length breaks or characters as its discriminator. i think the breaks between fields are too varied for split to work unless its wrapped in some other logical test

markijbema · Accepted Answer · 2011-03-02 21:37:47Z

0

You could do something like:

while ($line = <F>){
   if ($line =~ /(.{1}) (.{15}) ........ /){
     $nom_stat = $1;
     $lname = $2;
     ...
   }
}

I think it's faster than your substr suggestion, but I'm not sure whether it's the fastest solution, but I think it might very well be.

answered Mar 2, 2011 at 21:37

markijbema

4,06522 silver badges32 bronze badges

3 Comments

CheeseConQueso Over a year ago

this looks cryptic to me - not used to that syntax. what is this attempt doing in english?

RET Over a year ago

It's a regex, dot is any one character, and the number in braces is an occurrence count. In other words: 1 character, space, 15 characters [etc]. I still wouldn't do it this way though - use unpack().

markijbema Over a year ago

Wow, I did not expect it to be as slow as Ian C showed. I'm rather surprised really, I thought it would've at least been faster than substr...

Collectives™ on Stack Overflow

What is the most efficient way to parse a text file using Perl?

4 Answers 4

5 Comments

5 Comments

2 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

5 Comments

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related