Question
How do I specify the correct address of Dask workers on a remote resource to a Dask scheduler running locally?
Situation
I have a remote resource I can ssh into. There, I have a docker container that runs an image containing all the dependencies I need to run Dask, Distributed.
When run, the container executes the following:
dask-worker --nprocs 14 --nthreads 1 {inet_addr_local}:878
In the same network, but on my laptop, I run another container of the same image. In this container, I run the Dask scheduler, like so:
dask-scheduler --port 8786
When I start up the scheduler, everything is fine. When I start up the container of workers, it seems to connect to the scheduler. In the status I see the following:
Waiting to connect to: tcp://{this_matches_inet_address_of_local}:8786
On the scheduler, I see the following logged repeatedly, in a loop as it continually tries to contact/respond to each of the workers:
distributed.scheduler - INFO - Remove worker tcp://172.18.0.10:41508
distributed.scheduler - INFO - Removed worker tcp://172.18.0.10:41508
distributed.scheduler - ERROR - Failed to connect to worker 'tcp://172.18.0.10:44590': Timed out trying to connect to 'tcp://172.18.0.10:44590' after 3 s: OSError: [Errno 113] No route to host
The issue (I think) can be seen here. tcp://172.18.0.10 is incorrect. The workers on running on a resource db.foo.net that I can ssh into via [email protected].
From the scheduler container, I can see that I am able to ping db.foo.net successfully. I think that the workers are assuming their address is the local address for the container they are in, and not db.foo.net. I need to override this default as some sort of configuration for the workers. I thought --host tag would do it, but that causes Tornado to throw the following error: OSError: [Errno 99] Cannot assign requested address.