process multiple files in parallel

Question

I have a Perl script which reads two files and processes them.

The first file - info file - I store it as a hash (3.5 gb)

The second file - taregt file - I am processing by using information from the info file and other subroutines as designed. (This file, target, ranges from 30 - 60 gb)

So far working are:

reading the info file into a hash
breaking the target file into chunks

I want to run on all chunks in parallel:

while(chunks){
    # do something 

    sub a {}
    sub b {} 
}

So basically, I want to read a chunk, write its output and do this for multiple chunks at the same time. The while loop reads each line of a chunk file, and calls on various subroutine for processing.

Is there a way that I can read chunks in background?

I don't want to read info file for every chunk as it is 3.5gb long and I am reading it into hash, which takes up 3.5gb everytime.

Right now the script takes 1 - 2hrs to run for 30-60gb.

There is a lot of filtering and discarding, so the outout of the entire target file is ~ 500Mb — Arshi Arora
– Arshi Arora, Commented Sep 11, 2012 at 17:39
I would expect the OS to read the next chunk in the background without even being asked! — ikegami
– ikegami, Commented Sep 11, 2012 at 17:51
You said the file is 3.5GB and the hash that holds it is 3.5GB... I seriously doubt that. The hash is probably many GB more. — ikegami
– ikegami, Commented Sep 11, 2012 at 17:53
A good solution will totally depend on the structure of the input/output, which is not known. — perreal
– perreal, Commented Sep 11, 2012 at 18:27
Chunk here is a few lines of the main file. it does not go in background as it reads one chunk at a time. — Arshi Arora
– Arshi Arora, Commented Sep 11, 2012 at 19:42

Jean · Accepted Answer · 2012-09-11 17:50:17Z

1

You can try using Perl threads if parallel tasks are independent.

answered Sep 11, 2012 at 17:50

Jean

23k28 gold badges75 silver badges126 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

snoofkin Over a year ago

processes (forking) will be better for this task.

Arshi Arora Over a year ago

Can you point me to a quick tutorial on processes(forking) ?. Thanks !

matthias krull · Accepted Answer · 2012-09-11 18:08:11Z

1

A 3.5GB hash is very big, you should consider using a database instead. Depending on how you do this, you can keep accessing the database via the hash.

If memory were a non-issue, forking would be the easiest solution. However, this duplicates the process, including the hash, and would only result in unneccessary swapping.

If you cannot free some memory, you should consider to use threads. Perl threads only live inside the interpreter and are invisible to the OS. These threads have a similar feel to forking, however, you can declare variables as :shared. (You have to use threads::shared)

See the official Perl threading tutorial

edited Sep 11, 2012 at 18:08

matthias krull

4,4274 gold badges36 silver badges54 bronze badges

answered Sep 11, 2012 at 17:51

amon

57.8k2 gold badges93 silver badges152 bronze badges

Comments

Pavel Vlasov · Accepted Answer · 2012-09-11 18:40:41Z

1

What's about module File::Map (memory mapping), it can easy read big files.

use strict;
use File::Map qw(map_file);

map_file my $map, $ARGV[0]; # $ARGV[0] - path to your file
# Do something with $map

answered Sep 11, 2012 at 18:40

Pavel Vlasov

3,47523 silver badges21 bronze badges

1 Comment

Arshi Arora Over a year ago

Thanks amon and fxzuz for your suggestion. I am looking into threads and File::Map. What I need is to thread different outputs. And yes the chunks are independent. I am not thinking going the way of databases as I need the key information (not sure how that works either :) )

Collectives™ on Stack Overflow

process multiple files in parallel

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related