I am analysing some data using dask distributed on a SLURM cluster. I am also using jupyter notebook. I am changing my codebase frequently and running jobs. Recently, a lot of my jobs started to crash. I suspected that my code was not getting updated, so I did some tests and looks like thats the case (I changed names of the functions, restarted clusters, checked the relevant line numbers).
As I write this, I am running two cluster instances, and one works fine, while the other fails with this kind of error (see the logs below). I also noticed that some jobs run just fine within the same cluster, while others fails with this versioning issue. The jobs run fine when I test it on my local computer.
I should also point out that I am using a jupyter extension:
%load_ext autoreload
%autoreload 2
## This forces modules to reload every time when called
Any help with this would be appreciated.
Some more information:
- Dask version: dask, version 2023.5.0
- Python version: Python 3.8.16
- Operating System: Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-214-generic x86_64)
- Install method (conda, pip, source): conda
Worker log:
2025-11-09 19:47:55,102 - distributed.nanny - INFO - Start Nanny at: 'tcp://172.22.13.234:42697'
2025-11-09 19:47:56,145 - distributed.worker - INFO - Start worker at: tcp://172.22.13.234:33121
2025-11-09 19:47:56,145 - distributed.worker - INFO - Listening to: tcp://172.22.13.234:33121
2025-11-09 19:47:56,145 - distributed.worker - INFO - Worker name: DaskSlurmCluster-1
2025-11-09 19:47:56,145 - distributed.worker - INFO - dashboard at: 172.22.13.234:45521
2025-11-09 19:47:56,146 - distributed.worker - INFO - Waiting to connect to: tcp://172.22.13.232:40769
2025-11-09 19:47:56,146 - distributed.worker - INFO - -------------------------------------------------
2025-11-09 19:47:56,146 - distributed.worker - INFO - Threads: 1
2025-11-09 19:47:56,146 - distributed.worker - INFO - Memory: 0.95 GiB
2025-11-09 19:47:56,146 - distributed.worker - INFO - Local Directory: /tmp/dask-scratch-space/worker-7kpf4xi5
2025-11-09 19:47:56,146 - distributed.worker - INFO - -------------------------------------------------
2025-11-09 19:47:56,752 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-11-09 19:47:56,753 - distributed.worker - INFO - Registered to: tcp://172.22.13.232:40769
2025-11-09 19:47:56,753 - distributed.worker - INFO - -------------------------------------------------
2025-11-09 19:47:56,754 - distributed.core - INFO - Starting established connection to tcp://172.22.13.232:40769
2025-11-09 19:48:00,110 - distributed.worker - ERROR - Compute Failed
Key: ps_v6-d908d0d4-535c-4183-a3d4-5544b31da3e0
State: executing
Task: <Task 'ps_v6-d908d0d4-535c-4183-a3d4-5544b31da3e0' ps_v6()>
Exception: 'ImportError("cannot import name \'single_cell_locate_framewise_v2\' from \'trajectory.trackandsave\' (/tmp/dask-scratch-space/trajectory/trackandsave.py)")'
Traceback: ' File "<string>", line 398, in ps_v6\n'
... (several other similar blocks)
2025-11-09 19:48:03,072 - distributed.worker - INFO - Stopping worker at tcp://172.22.13.234:33121. Reason: scheduler-remove-worker
2025-11-09 19:48:03,075 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://172.22.13.234:42697'. Reason: scheduler-remove-worker
2025-11-09 19:48:03,076 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-11-09 19:48:03,178 - distributed.nanny - INFO - Worker closed
2025-11-09 19:48:03,178 - distributed.core - INFO - Connection to tcp://172.22.13.232:40769 has been closed.
slurmstepd-slurm4: error: *** JOB 1614242 ON slurm4 CANCELLED AT 2025-11-09T19:48:03 ***