7

I have a complex data structure (user-defined type) on which a large number of independent calculations are performed. The data structure is basically immutable. I say basically, because though the interface looks immutable, internally some lazy-evaluation is going on. Some of the lazily calculated attributes are stored in dictionaries (return values of costly functions by input parameter). I would like to use Pythons multiprocessing module to parallelize these calculations. There are two questions on my mind.

  1. How do I best share the data-structure between processes?
  2. Is there a way to handle the lazy-evaluation problem without using locks (multiple processes write the same value)?

Thanks in advance for any answers, comments or enlightening questions!

2
  • How large / complex are you talking? When an independent calculation is submitted, do you know before the start which lazy attributes are needed? Commented Aug 10, 2010 at 10:24
  • The problem is basically a leave-one-out cross-validation on a large set of data-samples. It takes about two hours on my machine on a single core, but I have access to a machine with 24 cores and would like to leverage that power. I do not know in advance which of the attributes will be needed by a single calculation, but I know that eventually (over all calculations) all attributes will be needed, so I could just load them all up front (would have to test that though). Commented Aug 10, 2010 at 10:30

1 Answer 1

8

How do I best share the data-structure between processes?

Pipelines.

origin.py | process1.py | process2.py | process3.py

Break your program up so that each calculation is a separate process of the following form.

def transform1( piece ):
    Some transformation or calculation.

For testing, you can use it like this.

def t1( iterable ):
    for piece in iterable:
        more_data = transform1( piece )
        yield NewNamedTuple( piece, more_data )

For reproducing the whole calculation in a single process, you can do this.

for x in t1( t2( t3( the_whole_structure ) ) ):
    print( x )

You can wrap each transformation with a little bit of file I/O. Pickle works well for this, but other representations (like JSON or YAML) work well, too.

while True:
    a_piece = pickle.load(sys.stdin)
    more_data = transform1( a_piece )
    pickle.dump( NewNamedTuple( piece, more_data ) )

Each processing step becomes an independent OS-level process. They will run concurrently and will -- immediately -- consume all OS-level resources.

Is there a way to handle the lazy-evaluation problem without using locks (multiple processes write the same value)?

Pipelines.

Sign up to request clarification or add additional context in comments.

3 Comments

Wow, that answers solves two problems that where not even in my question (how to send a complex object to another process, how to do this in python when the multiprocessing-module is not available)!
The point is that OS-level (shared buffer) process management is (a) simpler and (b) can be as fast as more complex multi-threaded, shared-everything techniques.
@S.Lott I want to share numpy random state of a parent process with a child process. I've tried using Manager but still no luck. Could you please take a look at my question here and see if you can offer a solution? I can still get different random numbers if I do np.random.seed(None) every time that I generate a random number, but this does not allow me to use the random state of the parent process, which is not what I want. Any help is greatly appreciated.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.