2

Is there a faster way to only retrieve a single element in a large published array with Dask without retrieving the entire array?

In the example below client.get_dataset('array1')[0] takes roughly the same time as client.get_dataset('array1').

import distributed
client = distributed.Client()
data = [1]*10000000
payload = {'array1': data}
client.publish(**payload)

one_element = client.get_dataset('array1')[0]

1 Answer 1

5

Note that anything you publish goes to the scheduler, not to the workers, so this is somewhat inefficient. Publish was intended to be used with Dask collections like dask.array.

Client 1

import dask.array as da
x = da.ones(10000000, chunks=(100000,))  # 1e7 size array cut into 1e5 size chunks
x = x.persist()  # persist array on the workers of the cluster

client.publish(x=x)  # store the metadata of x on the scheduler

Client 2

x = client.get_dataset('x')  # get the lazy collection x
x[0].compute()  # this selection happens on the worker, only the result comes down
Sign up to request clarification or add additional context in comments.

2 Comments

I would like to mark this question as complete but, when I run this code snippet using the approach suggested, it hangs at a[0].compute(). For the actual code snippet I use, please see the link below: stackoverflow.com/questions/45468673/…
I was able to get the code snippet you posted working. In order to get it working I had to make sure that workers were also created for the scheduler b/c in my version. For anyone else, see my link above for more info.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.