0

I’m having a problem when I try to run a subprocess (with Popen) in my python script that executes a bash command (slurm sbatch) on a different computing node.

The error happens during wandb.init(): wandb.sdk.wandb_manager.ManagerConnectionRefusedError: Connection to wandb service failed since the process is not available.

The sbatch command starts a job on a different node and looks like this: p = Popen([shutil.which("sbatch"), '--mem=40G', '--gres=gpu:titan_xp:1', '--nodelist=tikgpu02', '--cpus-per-task=2', '--output=/home/pschlaepfer/denselp/slt/log/%j.out', '--error=/home/pschlaepfer/denselp/slt/log/%j.err', '/home/pschlaepfer/denselp/slt/scripts/slt.sh', '--action=fine-tune-thf', '--max-length', '128', '--lr=4e-5', '--epochs=5', '--batch-size=16', '--task', task, '--pre-trained-path', checkpoint_path, '--wandb-mode=offline'], start_new_session=True)

wandb.init() is called like that:

experiment_name = f"job-id:{meta_config.job_id}"
run = wandb.init(
  project=wandb_project_choice+("-proto" if meta_config.is_debug_instance else ""),
  name=experiment_name,
  tags=[
    "job_id:"+str(meta_config.job_id)
  ],
  settings=wandb.Settings(start_method='fork'),
  dir=wandb_logging_dir_path,
  config=dict(experiment_config._asdict()) if type(experiment_config).__name__ == 'ExperimentConfig' else dict(experiment_config._as_dict()),
  reinit=True,
  mode="offline",
)

And here's the whole stacktrace:

    Traceback (most recent call last):
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py”, line 115, in _service_connect
    svc_iface._svc_connect(port=port)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/service/service_sock.py”, line 30, in _svc_connect
    self._sock_client.connect(port=port)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py”, line 102, in connect
    s.connect((“localhost”, port))
    ConnectionRefusedError: [Errno 111] Connection refused

    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/runpy.py”, line 86, in _run_code
    exec(code, run_globals)
    File “/home/pschlaepfer/denselp/slt/main.py”, line 107, in
    run = wandb.init(
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py”, line 1185, in init
    raise e
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py”, line 1162, in init
    wi.setup(kwargs)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py”, line 189, in setup
    self._wl = wandb_setup.setup(settings=setup_settings)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 327, in setup
    ret = _setup(settings=settings)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 320, in _setup
    wl = _WandbSetup(settings=settings)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 303, in init
    _WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 114, in init
    self._setup()
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 250, in _setup
    self._setup_manager()
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 277, in _setup_manager
    self._manager = wandb_manager._Manager(settings=self._settings)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py”, line 152, in init
    wandb._sentry.reraise(e)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/analytics/sentry.py”, line 154, in reraise
    raise exc.with_traceback(sys.exc_info()[2])
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py”, line 150, in init
    self._service_connect()
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py”, line 124, in _service_connect
    raise ManagerConnectionRefusedError(message)
    wandb.sdk.wandb_manager.ManagerConnectionRefusedError: Connection to wandb service failed since the process is not available.

Wandb version used is 0.16.0

Thank you very much for your help!

3
  • A quick question. You have wandb.init() in Job 1 and you are trying to log information from Job 2 submitted via Popen? You must have wandb.init() for each job. Wandb can't communite across multiple jobs unless there is a sweep agent which is distributing the parameters. What is your use case? Commented Nov 27, 2023 at 10:50
  • Sorry for the confusion. The second job is independent of the first one. The first calls wand.init(). Then it starts a subprocess which itself calls wand.init() at the beginning. The idea is that it gets logged as a separate wandb run. It just fine-tunes on the model checkpoint produced by the first job. Commented Nov 27, 2023 at 12:14
  • Ok. Can you create a minimal example? It doesn't need to be machine learning. It helps in tinkering. May be Popen is creating the issue. Commented Nov 27, 2023 at 12:21

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.