2

I have a function say fun() which yields generators. I want to check whether the generator is empty, and since i want to save as much run time as possible, I don't convert it to a list and check whether its empty. Instead I do this:

def peek(iterable):
    try:
        first = next(iterable)
    except StopIteration:
        return None
    return first, itertools.chain([first], iterable)

I use multiprocessing like this:

def call_generator_obj(ret_val):
    next = peek(ret_val)
    if next is not None:
        return False
    else:
        return True


def main():
    import multiprocessing as mp
    pool = mp.Pool(processes=mp.cpu_count()-1)
    # for loop over here
        ret_val = fun(args, kwargs)
        results.append(pool.apply(call_generator_obj, args=(ret_val,))
        # the above line throws me the error

As far as I know pickling is when converting some object in memory to a byte stream and I am doing anything like that in any of my functions.

TRACEBACK: (after the pointed line)

  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 253, in apply
    return self.apply_async(func, args, kwds).get()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 608, in get
    raise self._value
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 385, in _handle_tasks
    put(task)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
TypeError: can't pickle generator objects

1 Answer 1

2

As far as I know pickling is when converting some object in memory to a byte stream and I don't think I am doing anything like that over here.

Well, you are doing that.

You can't pass Python values directly between processes. Even the simplest variable holds a pointer to a structure somewhere in the process's memory space, and just copying that pointer to a different process would give you a segfault or garbage, depending on whether the same memory space in the other process is unmapped or mapped to something completely different. Something as complex as a generator—which is basically a live stack frame—would be even more impossible.

The way multiprocessing solves that is to transparently pickle everything you give it to pass. Functions, and their arguments, and their return values, all need to be pickled.

If you want to know how it work under the covers: a Pool essentially works by having a Queue that the parent puts tasks on—where the tasks are basically (func, args) pairs—and the children get tasks off. And a Queue essentially works by calling pickle.dumps(value) and then writing the result to a pipe or other inter-process-communication mechanism.

Sign up to request clarification or add additional context in comments.

8 Comments

hmmm... any solution/workaround then? without having to convert my generator into a list? Its the matter of saving run time and I am dealing with very large data ☹️
@J.Doe There is no fully-general solution. For most apps, there are specific solutions, where you can rewrite your iteration so that it's possible to grab a value or chunk of values in the parent and pass that to the pool function, instead of passing the iterator. But for your code as posted here, the only visible work is calling next on a generator. If that's true for your real code as well, it would probably require much larger changes.
What do you mean by "the only visible work is calling next on a generator"? You mean that is the thing creating the problem, right? And yes that is a toplevel summary of my code. I guess I will have to run a profiler and compromise between generators or multiprocessing
@J.Doe Well, yeah, but the point is that there's presumably real work being done inside the generator functions, and if you could surface that real work—e.g., maybe yield a bunch of values that can be passed a function, instead of yielding the result of calling that function—then you could multiprocess that work without rewriting everything from scratch. (Of course that would almost certainly mean breaking an abstraction that makes your code readable/maintainable/generic, but sometimes you have to do that for the sake of performance.)
sorry man its Open Source and the guys will go crazy if I try to do that 😛
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.