Local Dask scheduler failing to connect to workers on remote resource

Question

Question

How do I specify the correct address of Dask workers on a remote resource to a Dask scheduler running locally?

Situation

I have a remote resource I can ssh into. There, I have a docker container that runs an image containing all the dependencies I need to run Dask, Distributed.

When run, the container executes the following:

dask-worker --nprocs 14 --nthreads 1 {inet_addr_local}:878

In the same network, but on my laptop, I run another container of the same image. In this container, I run the Dask scheduler, like so:

dask-scheduler --port 8786

When I start up the scheduler, everything is fine. When I start up the container of workers, it seems to connect to the scheduler. In the status I see the following:

Waiting to connect to: tcp://{this_matches_inet_address_of_local}:8786

On the scheduler, I see the following logged repeatedly, in a loop as it continually tries to contact/respond to each of the workers:

distributed.scheduler - INFO - Remove worker tcp://172.18.0.10:41508
distributed.scheduler - INFO - Removed worker tcp://172.18.0.10:41508
distributed.scheduler - ERROR - Failed to connect to worker 'tcp://172.18.0.10:44590': Timed out trying to connect to 'tcp://172.18.0.10:44590' after 3 s: OSError: [Errno 113] No route to host

The issue (I think) can be seen here. tcp://172.18.0.10 is incorrect. The workers on running on a resource db.foo.net that I can ssh into via [email protected].

From the scheduler container, I can see that I am able to ping db.foo.net successfully. I think that the workers are assuming their address is the local address for the container they are in, and not db.foo.net. I need to override this default as some sort of configuration for the workers. I thought --host tag would do it, but that causes Tornado to throw the following error: OSError: [Errno 99] Cannot assign requested address.

Can you find the numerical IP of your worker, if not 172.18.0.10? Is it on an interface other than eth0? — mdurant
– mdurant, Commented Jun 21, 2017 at 19:32

MRocklin · Accepted Answer · 2017-06-22 13:59:52Z

1

Dask workers need to be able to contact the scheduler with the address given to them. It sounds like this isn't happening for you. This could be for many reasons associated to your network. A couple of possibilities:

You've mis-typed the address (for example I noticed that you used port 878 in one place in your question and port 8786 in another)
Your network doesn't allow communication on certain ports (check with your system administrator)
Your docker containers aren't set up to publish ports externally (you may need to do some docker-wiring or use the host network explicitly)

Unfortunately there isn't much that Dask itself can do to help you identify these network issues. You might try running other services on the relevant ports and seeing if you can recreate the lack of connectivity with common tools like ping or python -m http.serve --port 8786

answered Jun 22, 2017 at 13:59

MRocklin

57.5k29 gold badges176 silver badges245 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

kuanb Over a year ago

Thanks - do the how do the workers tell the scheduler where they are? Is that address acquired automatically? When workers and scheduler are on the same resource, everything works fine. In this situation I've set dask_scheduler as an alias in the Docker Compose links configuration. I assume that Docker is able to then proxy between containers. I assume that this is not happening in the case where my scheduler is elsewhere. If that's the case, I would need to be able to somehow tell the workers what to tell the scheduler as to where they are located. Hope that train of thought makes sense.

MRocklin Over a year ago

You can specify the address with the --host or --interface keyword. Try dask-worker --help for more information.

kuanb Over a year ago

Just wanted to follow up here in case anyone else runs into this - the issue was #3. We used the “host networking” mode in Docker so that the docker containers run on the host computers networking stack instead of the default “bridge” mode which creates a docker-specific network. Then, you can run the scheduler with something like dask-worker --host $(curl -s http://instance-data/latest/meta-data/local-ipv4) ... on our EC2 instances (see more about getting instance metadata here forums.aws.amazon.com/message.jspa?messageID=536813).

Collectives™ on Stack Overflow

Local Dask scheduler failing to connect to workers on remote resource

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related