4

I have have a bash script submit.sh for submitting training jobs to a Slurm server. It works as follows. Doing

bash submit.sh p1 8 config_file

will submit some task corresponding to config_file to 8 GPUs of partition p1. Each node of p1 has 4 GPUs, thus this command requests 2 nodes.

The content of submit.sh can be summarized as follows, in which I use sbatch to submit a Slurm script (train.slurm):

#!/bin/bash
# submit.sh

PARTITION=$1
NGPUs=$2
CONFIG=$3

NGPUS_PER_NODE=4
NCPUS_PER_TASK=10

sbatch --partition ${PARTITION} \
    --job-name=${CONFIG} \
    --output=logs/${CONFIG}_%j.log \
    --ntasks=${NGPUs} \
    --ntasks-per-node=${NGPUS_PER_NODE} \
    --cpus-per-task=${NCPUS_PER_TASK} \
    --gres=gpu:${NGPUS_PER_NODE} \
    --hint=nomultithread \
    --time=10:00:00
    --export=CONFIG=${CONFIG},NGPUs=${NGPUs},NGPUS_PER_NODE=${NGPUS_PER_NODE} \
    train.slurm

Now in the Slurm script, train.slurm, I decide whether to launch the training Python script on one or multiple nodes (the ways to launch it are different in these two cases):

#!/bin/bash
# train.slurm
#SBATCH --distribution=block:block

# Load Python environment
module purge
module load pytorch/py3/1.6.0
 
set -x

if [ ${NGPUs} -gt ${NGPUS_PER_NODE} ]; then # Multi-node training
    # Some variables needed for the training script
    export MASTER_PORT=12340
    export WORLD_SIZE=${NGPUs}
    # etc.

    srun python train.py --cfg ${CONFIG}
else # Single-node training
    python -u -m torch.distributed.launch --nproc_per_node=${NGPUS_PER_NODE} --use_env train.py --cfg ${CONFIG}
fi

Now if I submit on a single node (e.g., bash submit.sh p1 4 config_file), it works as expected. However, submitting on multiple nodes (e.g., bash submit.sh p1 8 config_file) produced the following error:

slurmstepd: error: execve(): python: No such file or directory

This means that the Python environment was not recognized on one of the nodes. I tried replacing python with $(which python) to take the full path to the Python binary in the virtual environment, but then I obtained another error:

OSError: libmpi_cxx.so.40: cannot open shared object file: No such file or directory

If I don't use submit.sh but instead, add all the #SBATCH variable to train.slurm, and submit the job using sbatch directly from the command line, then it works. Therefore, it seems that wrapping sbatch inside a bash script caused this issue.

Could you please help me to resolve this?

Thank you so much in advance.

1 Answer 1

9

Beware that the --export parameter will cause the environment for srun to be reset to exactly all the SLURM_* variables plus the ones explicitly set, so in your case CONFIG,NGPUs, NGPUS_PER_NODE. Consequently, the PATH variable will not be set and srun will not find the python executable.

Note that the --export does not alter the environment of the submission script, so the single-node case, that does not use srun, does indeed run fine.

Try submitting with

--export=ALL,CONFIG=${CONFIG},NGPUs=${NGPUs},NGPUS_PER_NODE=${NGPUS_PER_NODE} \

Note the added ALL as first item in the list.

Another option is to simply remove the --export line entirely and export the variables explicitly in the submit.sh script as the submission environment is propagated by default by Slurm to the job.

export PARTITION=$1
export NGPUs=$2
export CONFIG=$3

export NGPUS_PER_NODE=4
export NCPUS_PER_TASK=10
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.