How do I extract tabular data that's embedded within a non-tabular text-file?

Question

Have few thousand reports that have consistently formatted tabular data embedded within them that I need to extract.

Have a few ideas, but thought I'd post to see if there's a better way to do this than what I'm thinking; which is to extract the tabular data, create a new file for it, then parse that data as a tabular file.

Here's a sample input and output, where the output read and written row by row to a database.

INPUT_FILE

MiscText MiscText MiscText
MiscText MiscText MiscText
MiscText MiscText MiscText
SubHeader
PASS    1283019238  alksdjalskdjl
FAIL    102310928301    kajdlkajsldkaj
PASS    102930192830    aoisdajsdoiaj
PASS    192830192301    jiasdojoasi
MiscText MiscText MiscText
MiscText MiscText MiscText
MiscText MiscText MiscText

OUTPUT (read/write row-by-row from text-file to DB)

ROW-01{column01,column02,column03}
...
ROW-nth{column01,column02,column03}

Lumi · Accepted Answer · 2011-05-17 19:46:09Z

2

Recognizing when to start processing tabular data is easy. You've got the marker line. The difficulty is recognizing when to stop processing data. You can apply the heuristics of stopping to process data when the split doesn't yield the expected result.

use strict;
use warnings;
my $tab_data;
my $num_cols;
while ( <> ) {
    $tab_data = 1, next if $_ eq "SubHeader\n";
    next unless $tab_data;
    chomp;
    my @cols = split /\t/;
    $num_cols ||= scalar @cols;
    last if $num_cols and $num_cols != scalar @cols;
    print join( "\t", @cols ), "\n";
}

Save as etd.pl (etd = extract tabular data, what did you think?), and call it like this from the command line:

perl etd.pl < your-mixed-input.txt

edited May 17, 2011 at 19:46

answered May 17, 2011 at 19:23

Lumi

15.5k9 gold badges64 silver badges93 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

blunders Over a year ago

@Michael Ludwig: Thanks, look great -- though it appears I'm missing something. I've posted the code in the body of my question with $tab_data getting the sample data. When I run the code in a Perl debugger I use all the time (Ptkdb) the perl crashes/hanges at the while statement. Any idea what's going on, or what I'm missing? Again, thanks!

Lumi Over a year ago

@blunders, the script expects the data streaming in on STDIN, which is standard input. Open up a command prompt and give it a try. - Ah, and please revert your edit to your original post - it is totally misleading, and not as intended at all. Thanks.

blunders Over a year ago

+2 @Michael Ludwig: Reverted the body of the question. Follow your edits, though heading out of the office for a few hours, but expect to get back to this within 24-hours. Again, thank you!

Lumi Over a year ago

@TLP - $tab_data will only be set to one when $_ eq "SubHeader\n". Do perl -MO=Deparse,-p etd.pl or perl -d to verify.

Lumi Over a year ago

@TLP - You're right, it's only for one pass. And I agree the solution is not perfect. However, given the spec, it's certainly not inappropriate. And yes, it's a little bit obfuscating.

|

zvrba · Accepted Answer · 2011-05-17 19:02:56Z

1

If you know how to extract data, why create a new file instead of processing it immediately?

answered May 17, 2011 at 19:02

zvrba

24.7k3 gold badges57 silver badges66 bronze badges

1 Comment

blunders Over a year ago

+1 @zvrba: That's what I'd like to do, though I'd still have to figure out how to do it; all the code I've used so far OPEN<filehandle> a file, uses a WHILE<filehandle>, then CLOSE<filehandle> the file; no idea how to twist that into parse text by linefeeds. As for extracting the data, I just know it's possible; updated the sample data to give you a better idea of what I mean; that being the sub-header is always the same, and the tabular data goes on until there is not a (PASS or FAIL) on the next line.

snoofkin · Accepted Answer · 2011-05-17 23:10:26Z

0

In case this is a fixed width data, I would strongly suggest using unpack or plain old substr.

answered May 17, 2011 at 23:10

snoofkin

8,91514 gold badges52 silver badges89 bronze badges

Collectives™ on Stack Overflow

How do I extract tabular data that's embedded within a non-tabular text-file?

3 Answers 3

8 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related