2

I am working with a big dataset now. My input will be 4 different datasets and i have to apply a particular function to each dataset. So what i have done is to read all the four dataset and apply the function in parallel to each dataset using pool.map. So now i have a parent and 4 child process. Everything is fine till this.

Q1. Now what happens inside each process. In the function which i am applying on each dataset, i am comparing each tuple with other tuples, so its a kind of recursion. Is there a way to make it parallel, because this comparison may take long time since the dataset will be big. How to make it because its already a child process? Is it possible to parallelize it again within child process, because i have more processors, so i want to utilize it.

Q2. What i have in mind for parallelization of this recursive task is, if i am comparing tuple x with tuple y( every tuple with all other tuple), i can make chunks for x and each chunk does the comparison with y. This i guess can be done with two 'for loops'. Any suggestions how to do this?

1 Answer 1

5

Re: Q1, If you're creating your child processes using a multiprocessing.Pool, then no, the worker processes cannot have children. Attempting to create one will raise an exception:

AssertionError: daemonic processes are not allowed to have children

The reason is stated pretty clearly - the processes in a Pool are daemonic, and daemonic processes can't have children. The reason for this is that terminating the parent process will terminate its daemonic children, but the daemonic children will not be able to terminate their children, which will leave behind orphaned processes. This is stated in the documentation:

Note that a daemonic process is not allowed to create child processes. Otherwise a daemonic process would leave its children orphaned if it gets terminated when its parent process exits.

You can get around this by your parent processes creating a set of non-daemonic Process objects, rather than using a Pool. Then, each child can create its own multiprocessing.Pool:

import multiprocessing

def subf(x):
    print "in subf"

def f(x):
    print "in f"
    p = multiprocessing.Pool(2)
    p.map(subf, range(2))


if __name__ == "__main__":
    processes = []
    for i in range(2):
        proc = multiprocessing.Process(target=f, args=(i,))
        proc.start()
        processes.append(proc)

Output:

in f
in f
in subf
in subf
in subf
in subf

This approach seems like it will work ok for you, since your initial dataset just contains four items. You can just create one Process per item in the dataset, and still have some free CPUs to spare for each sub-process to use in a small Pool.

Re: Q2, it sounds like you could use itertools.product to create one large iterable of each pair of tuples you want to compare. You can then use pool.map to parallelize comparing each pair. Here's an example showing how that works:

def f(x):
    print(x)

if __name__ == "__main__":
    # Create two lists of tuples, like your use-case
    x = zip(range(3), range(3,6))
    y = zip(range(6, 9), range(9, 12))

    pool = multiprocessing.Pool()
    pool.map(f, itertools.product(x, y))

Output:

((0, 3), (6, 9))
((0, 3), (7, 10))
((0, 3), (8, 11))
((1, 4), (6, 9))
((1, 4), (7, 10))
((1, 4), (8, 11))
((2, 5), (6, 9))
((2, 5), (8, 11))
((2, 5), (7, 10))
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the explanation. In the second question, i am already using this data1 = itertools.combinations(values,2) new = (([i] for i, t in enumerate(zip(*pair)) if t[0]!=t[1]) for pair in data1) , where values has all my tuples and then data1 has all pairs. so you mean to say that, if i use pool.map(f, new, chunksize = 4), i can do that process in parallel using four process, is that right?
Using chunksize=4 will just send four elements of new at a time to each process in the pool. It won't split the list into four equal chunks. You can parallelize the comparisons using pool.map, but I suspect that the overhead of sending the chunks between processes may be more expensive than the cost of doing the comparisons themselves. So there's a good chance parallelizing this will actually end up being slower than doing it sequentially.
Cool. Thanks for that. It will be great if you could help me with one more question, i am stuck with that for long time.. :( - stackoverflow.com/questions/25949506/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.