Parsing huge text file in Perl

Question

I have genome file something about 30 gb similar to under below ,

>2RHet assembled 2006-03-27 md5sum:88c0ac39ebe4d9ef5a8f58cd746c9810
GAGAGGTGTGGAGAGGAGAGGAGAGGAGTGGTGAGGAGAGGAGAGGTGAG
GAGAGGAGAGGAGAGGAGAGGAATGGAGAGGAGAGGAGTCGAGAGGAGAG
GAGAGGAGTGGTGAGGAGAGGAGAGGAGTGGAGAGGAGACGTGAGGAGTG
GAGAGGAGAGTAGTGGAGAGGAGTGGAGAGGAGAGGAGAGGAGAGGACGG
ATTGTGTTGAGGACGGATTGTGTTACACTGATCGATGGCCGAGAACGAAC

I am trying to parse the file and achieve my task fast , using the below code character by character but the character is not getting printed

open (FH,"<:raw",'genome.txt') or die "cant open the file $!\n";

until ( eof(FH) ) {

$ch = getc(FH);
print "$ch\n";# not printing ch

}
close FH;

Your question makes no sense to me. It depends on what you want to do with the data. — m0skit0
– m0skit0, Commented Jan 24, 2013 at 20:31
You don't say what you want to do with the data, but reading it character by character is going to be very slow indeed and is almost certainly the wrong way to go about it. — Borodin
– Borodin, Commented Jan 24, 2013 at 20:52

amon · Accepted Answer · 2013-01-24 21:04:11Z

3

Your mistake is forgetting an eof:

until (eof FH) { ... }

But that is very unlikely to be the most efficient solution: Perl is slower than, say … C, so we want as few loop iterations as possible, and as much work done inside perl internals as we can get. This means that reading a file character by character is slow.

Also, use lexical variables (declared with my) instead of globals; this can lead to a performance increase.

Either pick a natural record delimiter (like \n), or read a certain number of bytes:

local $/ = \256; # read 256 bytes at a time.
while (<FH>) {
  # do something with the bytes
}

(see perlvar)

You could also shed all the luxuries that open, readline and even getc do for you, and use sysopen and sysread for total control. However, that way lies madness.

# not tested; I will *not* use sysread.
use Fcntl;
use constant NUM_OF_CHARS => 1; # equivalent to getc; set higher maybe.
sysopen FH, "genome.txt", O_RDONLY or die;

my $char;
while (sysread FH, $char, NUM_OF_CHARS, 0) {
  print($char .= "\n");  # appending should be better than concatenation.
}

If we are gone that far, using Inline::C is just a small and possibly preferable step.

edited Jan 24, 2013 at 21:04

answered Jan 24, 2013 at 20:48

amon

57.8k2 gold badges93 silver badges152 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

जलजनक Over a year ago

make it 1024 or 4096 bytes at a time if necessary.

Bill Ruppert Over a year ago

Make it 1mb at a time, Perl can handle that. Google used 64mb chunks in the old GFS.

made_in_india Over a year ago

I have benchmarking every process to read the file for 3.2 mb file with sliding window of 200 #1. FH,"<:raw" time take 179m51s #2. sysread( FH, $ch, 1 ) 90m36s #3. tie file module 181m

amon Over a year ago

@made_in_india you should see a further performance increase with sysread FH, $ch, 1024, or reading multiple characters in general. Your measurements seem awfully slow, pointing to a problem we cannot access. Look at Borodins comment under your question, and state your real problem.

Collectives™ on Stack Overflow

Parsing huge text file in Perl

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related