Dask Event loop was unresponsive - work not parallelized

Question

This is a follow-up to this question. I'm now trying to run Dask on multiple EC2 nodes on AWS.

I'm able to start up the scheduler on the first machine:

I then start up workers on several other machines. From the other machines I'm able to access the scheduler using nc -zv ${HOST} ${PORT}, and the workers otherwise seem to be able to connect to the master, as evidenced by the worker's sysout: Registered to: tcp://10.201.101.108:31001, but almost immediately the worker complains about a timeout loop.

From the master node, in my Jupyter notebook I then connect to the scheduler:

dask_client = Client('10.201.101.108:31001')

But the work does not propagate to the worker nodes (worker-node CPU stays at <1%) or even to the worker running on the same machine as the scheduler. This is a highly parallelized task and when running on a single machine (i.e., using Client(processes=False) consumes every core on the machine).

MRocklin · Accepted Answer · 2018-01-04 00:44:23Z

1

It is not uncommon to see the "Event loop was unresponsive" wanring when first connecting, depending on your network.

Some things to check

client.get_versions(check=True)
Does client.scheduler_info()['workers'] have anything? If not then you might have some trouble connecting
Consider looking at the worker logs with client.get_worker_logs()
Try running a simple computation like client.submit(lambda x: x + 1, 10).result()

answered Jan 4, 2018 at 0:44

MRocklin

57.5k29 gold badges176 silver badges245 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user554481 Over a year ago

client.get_versions(check=True) shows all of the nodes and their software versions, but both client.scheduler_info()['workers'] and client.get_worker_logs() hang indefinitely. That makes it seem like it's a network connectivity issue, but if true how would the scheduler have been able to return results from client.get_versions(check=True) if it were not able to connect to the worker nodes?

user554481 Over a year ago

Actually, I take that back: client.scheduler_info()['workers'] and client.get_worker_logs() aren't hanging. They're able to return results about all of the worker nodes quickly and without problems

user554481 Over a year ago

I've now also tried on my personal Mac (work Mac might have been more locked down) and the issue persists.

Collectives™ on Stack Overflow

Dask Event loop was unresponsive - work not parallelized

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related