0

I am analysing some data using dask distributed on a SLURM cluster. I am also using jupyter notebook. I am changing my codebase frequently and running jobs. Recently, a lot of my jobs started to crash. I suspected that my code was not getting updated, so I did some tests and looks like thats the case (I changed names of the functions, restarted clusters, checked the relevant line numbers).

As I write this, I am running two cluster instances, and one works fine, while the other fails with this kind of error (see the logs below). I also noticed that some jobs run just fine within the same cluster, while others fails with this versioning issue. The jobs run fine when I test it on my local computer.

I should also point out that I am using a jupyter extension:

%load_ext autoreload
%autoreload 2
## This forces modules to reload every time when called

Any help with this would be appreciated.

Some more information:

  • Dask version: dask, version 2023.5.0
  • Python version: Python 3.8.16
  • Operating System: Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-214-generic x86_64)
  • Install method (conda, pip, source): conda

Worker log:

2025-11-09 19:47:55,102 - distributed.nanny - INFO -         Start Nanny at: 'tcp://172.22.13.234:42697'
2025-11-09 19:47:56,145 - distributed.worker - INFO -       Start worker at:  tcp://172.22.13.234:33121
2025-11-09 19:47:56,145 - distributed.worker - INFO -          Listening to:  tcp://172.22.13.234:33121
2025-11-09 19:47:56,145 - distributed.worker - INFO -           Worker name:         DaskSlurmCluster-1
2025-11-09 19:47:56,145 - distributed.worker - INFO -          dashboard at:        172.22.13.234:45521
2025-11-09 19:47:56,146 - distributed.worker - INFO - Waiting to connect to:  tcp://172.22.13.232:40769
2025-11-09 19:47:56,146 - distributed.worker - INFO - -------------------------------------------------
2025-11-09 19:47:56,146 - distributed.worker - INFO -               Threads:                          1
2025-11-09 19:47:56,146 - distributed.worker - INFO -                Memory:                   0.95 GiB
2025-11-09 19:47:56,146 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-7kpf4xi5
2025-11-09 19:47:56,146 - distributed.worker - INFO - -------------------------------------------------
2025-11-09 19:47:56,752 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-11-09 19:47:56,753 - distributed.worker - INFO -         Registered to:  tcp://172.22.13.232:40769
2025-11-09 19:47:56,753 - distributed.worker - INFO - -------------------------------------------------
2025-11-09 19:47:56,754 - distributed.core - INFO - Starting established connection to tcp://172.22.13.232:40769
2025-11-09 19:48:00,110 - distributed.worker - ERROR - Compute Failed
Key:       ps_v6-d908d0d4-535c-4183-a3d4-5544b31da3e0
State:     executing
Task:  <Task 'ps_v6-d908d0d4-535c-4183-a3d4-5544b31da3e0' ps_v6()>
Exception: 'ImportError("cannot import name \'single_cell_locate_framewise_v2\' from \'trajectory.trackandsave\' (/tmp/dask-scratch-space/trajectory/trackandsave.py)")'
Traceback: '  File "<string>", line 398, in ps_v6\n'

... (several other similar blocks)

2025-11-09 19:48:03,072 - distributed.worker - INFO - Stopping worker at tcp://172.22.13.234:33121. Reason: scheduler-remove-worker
2025-11-09 19:48:03,075 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://172.22.13.234:42697'. Reason: scheduler-remove-worker
2025-11-09 19:48:03,076 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-11-09 19:48:03,178 - distributed.nanny - INFO - Worker closed
2025-11-09 19:48:03,178 - distributed.core - INFO - Connection to tcp://172.22.13.232:40769 has been closed.
slurmstepd-slurm4: error: *** JOB 1614242 ON slurm4 CANCELLED AT 2025-11-09T19:48:03 ***
1
  • 1
    If you are modifying code on a shared file system, and starting workers on different node using this file system, this might cause PYTHON path problem. The easier is to force some pip install of your module into a proper python environment. Else, you need to make sure Python path is correctly set on remote worker. Another thing, it might work sometimes because all your worker are on the same node, try to specify dask-scratch-space on a shared file system. Commented Nov 14 at 16:17

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.