1

dask.distributed keeps data in memory on workers until that data is no longer needed. (Thanks @MRocklin!)

While this process is efficient in terms of network usage, it will still result in frequent pickling and unpickling of data. I assume this is not a zero copy pickling, like memmapping would do for plasma or joblib in case of numpy arrays.

It's clear that when calculation result is needed on another host or other worker, it has to be pickled and sent over. This can be avoided when relying on threaded parallelism inside the same host, as all calculations access the same memory (--nthreads=8). dask is using this trick and simply accesses the result of calculations instead of pickling.

When we're using process based parallelization instead of threading in a single host (--nproc=8 --nthreads=1), I'd expect that dask pickles and unpickles again to send data through to the other worker. Is this correct?

3
  • Did you notice the costs of these steps? SER/DES-pickle costs are awfully high for anz ultra low-latency processing. The threading tricks ( if you compare 'em to process-based methods ) are ignoring the costs of waiting for monopolistic GIL-lock ( in 2020 / 2H still ) governing re-[SERIAL]-isation of any amount of threads back into a just plain duck-duck-go one-thread-(with GIL)-after-another-thread-(with GIL) and all others keep waiting ... Definitely not a sort of hpc performance, is it? Commented Jul 3, 2020 at 23:57
  • Sure, it's my aim to avoid picling and allocating memory again when unpickling, but then horses for causes. dask is great in identifying common sub-trees of calculations and can save me a lot of calculation time by optimal execution. Once data is sent over to another worker it should be quite fast. I'm comparing it to tools like ray, which for example passes results through plasma (and offer memmapping with zero copy read), but probably not able to recognize common parts of larger calculation trees, so will actually calculate more. Commented Jul 4, 2020 at 10:43
  • There is an open discussion about using shared memory if two workers are on the same host: github.com/dask/dask/issues/6267 Commented Jul 4, 2020 at 11:14

1 Answer 1

2

In case the same process needs the calculation result (continuing based on the temporary result), is it smart enough to keep a cache in the given process and reuse results there?

Yes. Dask keeps data in memory on workers until that data is no longer needed. https://distributed.dask.org/en/latest/memory.html

Edit: When Dask moves data between processes on the same host then yes, it serializes data and moves it across a local socket and then deserializes it. It doesn't necessarily use pickle to serialize. It depends on the type.

Sign up to request clarification or add additional context in comments.

1 Comment

Hi Matt, that's a good point! I rephrased my question to make it more readable...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.