parallelize recursion python

Question

I am working with a big dataset now. My input will be 4 different datasets and i have to apply a particular function to each dataset. So what i have done is to read all the four dataset and apply the function in parallel to each dataset using pool.map. So now i have a parent and 4 child process. Everything is fine till this.

Q1. Now what happens inside each process. In the function which i am applying on each dataset, i am comparing each tuple with other tuples, so its a kind of recursion. Is there a way to make it parallel, because this comparison may take long time since the dataset will be big. How to make it because its already a child process? Is it possible to parallelize it again within child process, because i have more processors, so i want to utilize it.

Q2. What i have in mind for parallelization of this recursive task is, if i am comparing tuple x with tuple y( every tuple with all other tuple), i can make chunks for x and each chunk does the comparison with y. This i guess can be done with two 'for loops'. Any suggestions how to do this?

dano · Accepted Answer · 2014-09-21 04:47:29Z

5

Re: Q1, If you're creating your child processes using a multiprocessing.Pool, then no, the worker processes cannot have children. Attempting to create one will raise an exception:

AssertionError: daemonic processes are not allowed to have children

The reason is stated pretty clearly - the processes in a Pool are daemonic, and daemonic processes can't have children. The reason for this is that terminating the parent process will terminate its daemonic children, but the daemonic children will not be able to terminate their children, which will leave behind orphaned processes. This is stated in the documentation:

Note that a daemonic process is not allowed to create child processes. Otherwise a daemonic process would leave its children orphaned if it gets terminated when its parent process exits.

You can get around this by your parent processes creating a set of non-daemonic Process objects, rather than using a Pool. Then, each child can create its own multiprocessing.Pool:

import multiprocessing

def subf(x):
    print "in subf"

def f(x):
    print "in f"
    p = multiprocessing.Pool(2)
    p.map(subf, range(2))


if __name__ == "__main__":
    processes = []
    for i in range(2):
        proc = multiprocessing.Process(target=f, args=(i,))
        proc.start()
        processes.append(proc)

Output:

in f
in f
in subf
in subf
in subf
in subf

This approach seems like it will work ok for you, since your initial dataset just contains four items. You can just create one Process per item in the dataset, and still have some free CPUs to spare for each sub-process to use in a small Pool.

Re: Q2, it sounds like you could use itertools.product to create one large iterable of each pair of tuples you want to compare. You can then use pool.map to parallelize comparing each pair. Here's an example showing how that works:

def f(x):
    print(x)

if __name__ == "__main__":
    # Create two lists of tuples, like your use-case
    x = zip(range(3), range(3,6))
    y = zip(range(6, 9), range(9, 12))

    pool = multiprocessing.Pool()
    pool.map(f, itertools.product(x, y))

Output:

((0, 3), (6, 9))
((0, 3), (7, 10))
((0, 3), (8, 11))
((1, 4), (6, 9))
((1, 4), (7, 10))
((1, 4), (8, 11))
((2, 5), (6, 9))
((2, 5), (8, 11))
((2, 5), (7, 10))

answered Sep 21, 2014 at 4:47

dano

95.5k21 gold badges234 silver badges231 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

ds_user Over a year ago

Thanks for the explanation. In the second question, i am already using this data1 = itertools.combinations(values,2) new = (([i] for i, t in enumerate(zip(*pair)) if t[0]!=t[1]) for pair in data1) , where values has all my tuples and then data1 has all pairs. so you mean to say that, if i use pool.map(f, new, chunksize = 4), i can do that process in parallel using four process, is that right?

dano Over a year ago

Using chunksize=4 will just send four elements of new at a time to each process in the pool. It won't split the list into four equal chunks. You can parallelize the comparisons using pool.map, but I suspect that the overhead of sending the chunks between processes may be more expensive than the cost of doing the comparisons themselves. So there's a good chance parallelizing this will actually end up being slower than doing it sequentially.

ds_user Over a year ago

Cool. Thanks for that. It will be great if you could help me with one more question, i am stuck with that for long time.. :( - stackoverflow.com/questions/25949506/…

Collectives™ on Stack Overflow

parallelize recursion python

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related