0

Problem: Inspired by this thread, I'm trying to write a wrapper script that submits SLURM array jobs with bash variables. However, I'm running into issues with SLURM environment variables like $SLURM_ARRAY_TASK_ID as it acts as an empty variable.

I suspect it has something to do with how the test_wrapper.sh is parsing the yet undefined SLURM variable, but I can't seem to find a solution.

Below I provide a working example with a simple python script that should take an array ID as an input variable, but when it is called by the bash wrapper script, the python script crashes as it receives an empty variable.

test_wrapper.sh :

#!/bin/bash
for argument in "$@"
do
  key=$(echo $argument | cut -f 1 -d'=')
  value=$(echo $argument | cut -f 2 -d'=')
  case "$key" in
    "job_name")     job_name="$value" ;;
    "cpus")         cpus="$value" ;;
    "memory")       memory="$value" ;;
    "time")         time="$value" ;;
    "array")        array="$value" ;;
    *)
  esac
done

sbatch <<EOT
#!/bin/bash
#SBATCH --account=foobar
#SBATCH --cpus-per-task=${cpus:-1}
#SBATCH --mem-per-cpu=${memory:-1}GB
#SBATCH --time=${time:-00:01:00}
#SBATCH --array=${array:-1-2}
#SBATCH --job-name=${job_name:-Default_Job_Name}

if [ -z "$SLURM_ARRAY_TASK_ID" ]
then
      echo "The array ID \$SLURM_ARRAY_TASK_ID is empty"
else
      echo "The array ID \$SLURM_ARRAY_TASK_ID is NOT empty"
fi

srun python foo.py -a $SLURM_ARRAY_TASK_ID

echo "Job finished with exit code $?"

EOT

where foo.py is:

import argparse

def main(args):
  print('array number is : {}'.format(args.array_number))

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-a", "--array_number",
        help="the value passed from SLURM_ARRAY_TASK_ID"
        )
    args = parser.parse_args()
    main(args)

$cat slurm-123456789_1.out yields :

The array ID 1 is empty
usage: foo.py [-h] [-a ARRAY_NUMBER]
foo.py: error: argument -a/--array_number: expected one argument
srun: error: nc10931: task 0: Exited with exit code 2
Job finished with exit code 0

I find it strange, that "The array ID 1 is empty" is correctly printing the $SLURM_ARRAY_TASK_ID (??)

6
  • 1
    It might be helpful here to collect xtrace logs -- extending test_wrapper.sh to, after the shebang, run exec 2>trace.log; PS4=':$LINENO+'; set -x and then reviewing trace.log will at least let you see what's going on in the loop that iterates over arguments. Commented Sep 16, 2022 at 20:41
  • 1
    ...if you replaced sbatch <<EOT with { tee /dev/stderr | sbatch; } <<EOT, then your trace log would also contain the script as it's being passed to sbatch. Commented Sep 16, 2022 at 20:42
  • 1
    Why do you only escape $SLURM_ARRAY_TASK_ID in some places and not others? The unescaped ones are being substituted before the heredoc is given to sbatch. Is that variable perhaps being set by sbatch as part of executing the script? Commented Sep 16, 2022 at 20:54
  • 1
    There's a similar issue in your log where srun says it exited with exit code 2, but then you print Job finished with exit code 0 because $? is substituted before sbatch receives the script. Perhaps you want to prevent parameter expansions by quoting <<"EOT"? Commented Sep 16, 2022 at 20:57
  • 1
    The line from the first comment was to add into the top of the test_wrapper.sh script (as that comment said). As comments don't preserve whitespace, extra content should be edited into the question. Commented Sep 16, 2022 at 21:11

2 Answers 2

1

So according to this page:

Job arrays will have two additional environment variable set. SLURM_ARRAY_JOB_ID will be set to the first job ID of the array. SLURM_ARRAY_TASK_ID will be set to the job array index value.

That suggests to me that sbatch is supposed to set these for you. In that case, you need to escape all instances of $SLURM_ARRAY_TASK_ID in the script you pass via the heredoc so that they don't get prematurely substituted before sbatch can set the relevant environment variable.

The two options for this are:

  1. If you don't want any expansions to occur at all, quote the heredoc delimiter.
sbatch <<"EOT"
<your script here>
EOT
  1. If you need some expansions to occur but want to disable others, then escape the ones that should not be expanded by putting a \ in front of them like you have done in your existing script.
Sign up to request clarification or add additional context in comments.

Comments

1

Thanks to the feedback posted in the comments I was able to fix the issue. Posting a "fixed" version of the wrapper script below.

In short, the solution is to escape $SLURM_ARRAY_TASK_ID.

#!/bin/bash
for argument in "$@"
do
  key=$(echo $argument | cut -f 1 -d'=')
  value=$(echo $argument | cut -f 2 -d'=')
  case "$key" in
    "job_name")     job_name="$value" ;;
    "cpus")         cpus="$value" ;;
    "memory")       memory="$value" ;;
    "time")         time="$value" ;;
    "array")        array="$value" ;;
    *)
  esac
done

{ tee /dev/stderr | sbatch; } <<EOT
#!/bin/bash
#SBATCH --account=foobar
#SBATCH --cpus-per-task=${cpus:-1}
#SBATCH --mem-per-cpu=${memory:-1}GB
#SBATCH --time=${time:-00:01:00}
#SBATCH --array=${array:-1-2}
#SBATCH --job-name=${job_name:-Default_Job_Name}

if [ -z "\$SLURM_ARRAY_TASK_ID" ]
then
      echo "The array ID \$SLURM_ARRAY_TASK_ID is empty"
else
      echo "The array ID \$SLURM_ARRAY_TASK_ID is NOT empty"
fi

python foo.py -a \$SLURM_ARRAY_TASK_ID

EOT

cat slurm-123456789_1.out yields :

The array ID 1 is NOT empty
array number is : 1

Note: the { tee /dev/stderr | sbatch; } is not necessary, but is very useful for debugging (thanks Charles Duffy)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.