Dask distributed memory and constant pickling/unpickling of results

Question

dask.distributed keeps data in memory on workers until that data is no longer needed. (Thanks @MRocklin!)

While this process is efficient in terms of network usage, it will still result in frequent pickling and unpickling of data. I assume this is not a zero copy pickling, like memmapping would do for plasma or joblib in case of numpy arrays.

It's clear that when calculation result is needed on another host or other worker, it has to be pickled and sent over. This can be avoided when relying on threaded parallelism inside the same host, as all calculations access the same memory (--nthreads=8). dask is using this trick and simply accesses the result of calculations instead of pickling.

When we're using process based parallelization instead of threading in a single host (--nproc=8 --nthreads=1), I'd expect that dask pickles and unpickles again to send data through to the other worker. Is this correct?

Did you notice the costs of these steps? SER/DES-pickle costs are awfully high for anz ultra low-latency processing. The threading tricks ( if you compare 'em to process-based methods ) are ignoring the costs of waiting for monopolistic GIL-lock ( in 2020 / 2H still ) governing re-[SERIAL]-isation of any amount of threads back into a just plain duck-duck-go one-thread-(with GIL)-after-another-thread-(with GIL) and all others keep waiting ... Definitely not a sort of hpc performance, is it? — user3666197
– user3666197, Commented Jul 3, 2020 at 23:57
Sure, it's my aim to avoid picling and allocating memory again when unpickling, but then horses for causes. dask is great in identifying common sub-trees of calculations and can save me a lot of calculation time by optimal execution. Once data is sent over to another worker it should be quite fast. I'm comparing it to tools like ray, which for example passes results through plasma (and offer memmapping with zero copy read), but probably not able to recognize common parts of larger calculation trees, so will actually calculate more. — Mark Horvath
– Mark Horvath, Commented Jul 4, 2020 at 10:43
There is an open discussion about using shared memory if two workers are on the same host: github.com/dask/dask/issues/6267 — Mark Horvath
– Mark Horvath, Commented Jul 4, 2020 at 11:14

MRocklin · Accepted Answer · 2020-08-08 01:34:41Z

2

In case the same process needs the calculation result (continuing based on the temporary result), is it smart enough to keep a cache in the given process and reuse results there?

Yes. Dask keeps data in memory on workers until that data is no longer needed. https://distributed.dask.org/en/latest/memory.html

Edit: When Dask moves data between processes on the same host then yes, it serializes data and moves it across a local socket and then deserializes it. It doesn't necessarily use pickle to serialize. It depends on the type.

edited Aug 8, 2020 at 1:34

answered Jul 3, 2020 at 19:38

MRocklin

57.5k29 gold badges176 silver badges245 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mark Horvath Over a year ago

Hi Matt, that's a good point! I rephrased my question to make it more readable...

Collectives™ on Stack Overflow

Dask distributed memory and constant pickling/unpickling of results

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related