4

This is a follow-up to this question. I'm now trying to run Dask on multiple EC2 nodes on AWS.

I'm able to start up the scheduler on the first machine:

enter image description here

I then start up workers on several other machines. From the other machines I'm able to access the scheduler using nc -zv ${HOST} ${PORT}, and the workers otherwise seem to be able to connect to the master, as evidenced by the worker's sysout: Registered to: tcp://10.201.101.108:31001, but almost immediately the worker complains about a timeout loop.

enter image description here

From the master node, in my Jupyter notebook I then connect to the scheduler:

dask_client = Client('10.201.101.108:31001')

But the work does not propagate to the worker nodes (worker-node CPU stays at <1%) or even to the worker running on the same machine as the scheduler. This is a highly parallelized task and when running on a single machine (i.e., using Client(processes=False) consumes every core on the machine).

1 Answer 1

1

It is not uncommon to see the "Event loop was unresponsive" wanring when first connecting, depending on your network.

Some things to check

  1. client.get_versions(check=True)
  2. Does client.scheduler_info()['workers'] have anything? If not then you might have some trouble connecting
  3. Consider looking at the worker logs with client.get_worker_logs()
  4. Try running a simple computation like client.submit(lambda x: x + 1, 10).result()
Sign up to request clarification or add additional context in comments.

3 Comments

client.get_versions(check=True) shows all of the nodes and their software versions, but both client.scheduler_info()['workers'] and client.get_worker_logs() hang indefinitely. That makes it seem like it's a network connectivity issue, but if true how would the scheduler have been able to return results from client.get_versions(check=True) if it were not able to connect to the worker nodes?
Actually, I take that back: client.scheduler_info()['workers'] and client.get_worker_logs() aren't hanging. They're able to return results about all of the worker nodes quickly and without problems
I've now also tried on my personal Mac (work Mac might have been more locked down) and the issue persists.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.