dask.distributed keeps data in memory on workers until that data is no longer needed. (Thanks @MRocklin!)
While this process is efficient in terms of network usage, it will still result in frequent pickling and unpickling of data. I assume this is not a zero copy pickling, like memmapping would do for plasma or joblib in case of numpy arrays.
It's clear that when calculation result is needed on another host or other worker, it has to be pickled and sent over. This can be avoided when relying on threaded parallelism inside the same host, as all calculations access the same memory (--nthreads=8). dask is using this trick and simply accesses the result of calculations instead of pickling.
When we're using process based parallelization instead of threading in a single host (--nproc=8 --nthreads=1), I'd expect that dask pickles and unpickles again to send data through to the other worker. Is this correct?
daskis great in identifying common sub-trees of calculations and can save me a lot of calculation time by optimal execution. Once data is sent over to another worker it should be quite fast. I'm comparing it to tools likeray, which for example passes results through plasma (and offer memmapping with zero copy read), but probably not able to recognize common parts of larger calculation trees, so will actually calculate more.