0

I have genome file something about 30 gb similar to under below ,

>2RHet assembled 2006-03-27 md5sum:88c0ac39ebe4d9ef5a8f58cd746c9810
GAGAGGTGTGGAGAGGAGAGGAGAGGAGTGGTGAGGAGAGGAGAGGTGAG
GAGAGGAGAGGAGAGGAGAGGAATGGAGAGGAGAGGAGTCGAGAGGAGAG
GAGAGGAGTGGTGAGGAGAGGAGAGGAGTGGAGAGGAGACGTGAGGAGTG
GAGAGGAGAGTAGTGGAGAGGAGTGGAGAGGAGAGGAGAGGAGAGGACGG
ATTGTGTTGAGGACGGATTGTGTTACACTGATCGATGGCCGAGAACGAAC

I am trying to parse the file and achieve my task fast , using the below code character by character but the character is not getting printed

open (FH,"<:raw",'genome.txt') or die "cant open the file $!\n";

until ( eof(FH) ) {

$ch = getc(FH);
print "$ch\n";# not printing ch

}
close FH;
8
  • Your question makes no sense to me. It depends on what you want to do with the data. Commented Jan 24, 2013 at 20:31
  • 3
    until(<FH>) is ... quite unusual. Commented Jan 24, 2013 at 20:39
  • added incorrect code .. Just now corrected the code Commented Jan 24, 2013 at 20:45
  • It's really unclear what you're trying to accomplish. Commented Jan 24, 2013 at 20:46
  • 1
    You don't say what you want to do with the data, but reading it character by character is going to be very slow indeed and is almost certainly the wrong way to go about it. Commented Jan 24, 2013 at 20:52

1 Answer 1

3

Your mistake is forgetting an eof:

until (eof FH) { ... }

But that is very unlikely to be the most efficient solution: Perl is slower than, say … C, so we want as few loop iterations as possible, and as much work done inside perl internals as we can get. This means that reading a file character by character is slow.

Also, use lexical variables (declared with my) instead of globals; this can lead to a performance increase.

Either pick a natural record delimiter (like \n), or read a certain number of bytes:

local $/ = \256; # read 256 bytes at a time.
while (<FH>) {
  # do something with the bytes
}

(see perlvar)

You could also shed all the luxuries that open, readline and even getc do for you, and use sysopen and sysread for total control. However, that way lies madness.

# not tested; I will *not* use sysread.
use Fcntl;
use constant NUM_OF_CHARS => 1; # equivalent to getc; set higher maybe.
sysopen FH, "genome.txt", O_RDONLY or die;

my $char;
while (sysread FH, $char, NUM_OF_CHARS, 0) {
  print($char .= "\n");  # appending should be better than concatenation.
}

If we are gone that far, using Inline::C is just a small and possibly preferable step.

Sign up to request clarification or add additional context in comments.

4 Comments

make it 1024 or 4096 bytes at a time if necessary.
Make it 1mb at a time, Perl can handle that. Google used 64mb chunks in the old GFS.
I have benchmarking every process to read the file for 3.2 mb file with sliding window of 200 #1. FH,"<:raw" time take 179m51s #2. sysread( FH, $ch, 1 ) 90m36s #3. tie file module 181m
@made_in_india you should see a further performance increase with sysread FH, $ch, 1024, or reading multiple characters in general. Your measurements seem awfully slow, pointing to a problem we cannot access. Look at Borodins comment under your question, and state your real problem.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.