2

I have a Perl script which reads two files and processes them.

The first file - info file - I store it as a hash (3.5 gb)

The second file - taregt file - I am processing by using information from the info file and other subroutines as designed. (This file, target, ranges from 30 - 60 gb)

So far working are:

  • reading the info file into a hash
  • breaking the target file into chunks

I want to run on all chunks in parallel:

while(chunks){
    # do something 

    sub a {}
    sub b {} 
} 

So basically, I want to read a chunk, write its output and do this for multiple chunks at the same time. The while loop reads each line of a chunk file, and calls on various subroutine for processing.

Is there a way that I can read chunks in background?

I don't want to read info file for every chunk as it is 3.5gb long and I am reading it into hash, which takes up 3.5gb everytime.

Right now the script takes 1 - 2hrs to run for 30-60gb.

5
  • There is a lot of filtering and discarding, so the outout of the entire target file is ~ 500Mb Commented Sep 11, 2012 at 17:39
  • I would expect the OS to read the next chunk in the background without even being asked! Commented Sep 11, 2012 at 17:51
  • 3
    You said the file is 3.5GB and the hash that holds it is 3.5GB... I seriously doubt that. The hash is probably many GB more. Commented Sep 11, 2012 at 17:53
  • 1
    A good solution will totally depend on the structure of the input/output, which is not known. Commented Sep 11, 2012 at 18:27
  • Chunk here is a few lines of the main file. it does not go in background as it reads one chunk at a time. Commented Sep 11, 2012 at 19:42

3 Answers 3

1

You can try using Perl threads if parallel tasks are independent.

Sign up to request clarification or add additional context in comments.

2 Comments

processes (forking) will be better for this task.
Can you point me to a quick tutorial on processes(forking) ?. Thanks !
1

A 3.5GB hash is very big, you should consider using a database instead. Depending on how you do this, you can keep accessing the database via the hash.

If memory were a non-issue, forking would be the easiest solution. However, this duplicates the process, including the hash, and would only result in unneccessary swapping.

If you cannot free some memory, you should consider to use threads. Perl threads only live inside the interpreter and are invisible to the OS. These threads have a similar feel to forking, however, you can declare variables as :shared. (You have to use threads::shared)

See the official Perl threading tutorial

Comments

1

What's about module File::Map (memory mapping), it can easy read big files.

use strict;
use File::Map qw(map_file);

map_file my $map, $ARGV[0]; # $ARGV[0] - path to your file
# Do something with $map

1 Comment

Thanks amon and fxzuz for your suggestion. I am looking into threads and File::Map. What I need is to thread different outputs. And yes the chunks are independent. I am not thinking going the way of databases as I need the key information (not sure how that works either :) )

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.