Why can't I run multiple instances of the same python script simulataniously in SLURM

Question

I have been struggling trying to get multiple instances of a python script to run on SLURM. In my login node I have installed python3.6 and I have a python script "my_script.py" which takes a text file as input to read in run parameters. I can run this script on the login node using

python3.6 my_script.py input1.txt

Furthermore, I can submit a script submit.sh to run the job:

#!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample
#SBATCH --output=output1.txt
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G

python3.6 my_script.py input1.txt

This runs fine and executes as expected. However, if I submit the following script:

#!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample
#SBATCH --output=output2.txt
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G

python3.6 my_script.py input2.txt

while the first is running I get the following error message in output2.txt:

/var/spool/slurmd/job00130/slurm_script: line 9: python3.6: command not 
found

I found that I have this same issue when I try to submit a job as an array. For example, when I submit the following with sbatch:

!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample 
#SBATCH --output=out_%j.txt
#SBATCH --array=1-10
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G
echo PWD $PWD
cd $SLURM_SUBMIT_DIR
python3.6 my_script.py input_$SLURM_ARRAY_TASK_ID.txt
~

I find that only out_1.txt shows that the job ran. All of the output files for tasks 2-10 show the same error message:

/var/spool/slurmd/job00130/slurm_script: line 9: python3.6: command not

I am running all of these scripts in an HPC cluster that I created using the Compute Engine API in the Google Cloud Platform. I used the following tutorial to set up the SLURM cluster:

https://codelabs.developers.google.com/codelabs/hpc-slurm-on-gcp/#0

Why is SLURM unable to run multiple python3.6 jobs at the same time and how can I get my array submission to work? I have spent days going through SLURM FAQs and other stack questions but I have not found out a way to resolve this issue or a suitable explanation of whats causing the issue in the first place.

Thank you

Could it be that the first job runs on one machine, and the second on another? And that on this second node, Python3 is not installed? What is the structure of your cluster? — damienfrancois
– damienfrancois, Commented Oct 24, 2018 at 6:28
Welome to Stackoverflow! As Damien suggested, could you please edit your post to include the cluster configuration yaml file. — Dima Chubarov
– Dima Chubarov, Commented Oct 24, 2018 at 17:33
I could not locate the yaml file that I used to create the cluster, but I had used the template from the tutorial. I have now resolved the issue. — Quintin Sheridan
– Quintin Sheridan, Commented Oct 25, 2018 at 1:17
@damienfrancois I am now having issues getting my task array to run multiple tasks per node when specifying --cpus-per-task=1 and --tasks-per-node=2 in my submission script. Could you point me towards an example submission script to fill a node with tasks from a taks array based on the mem and cpu settings? I have not been able to find a good example to work from and I would really appreciate it. — Quintin Sheridan
– Quintin Sheridan, Commented Oct 25, 2018 at 2:15
Is slurm configured to run multiple jobs the same node? What is the value of SelectType in the configuration file? — damienfrancois
– damienfrancois, Commented Oct 25, 2018 at 8:14

Quintin Sheridan · Accepted Answer · 2018-10-25 01:12:22Z

I found out what I was doing wrong. I had created a cluster with two compute nodes, compute1 and compute2. At some point when I was trying to get things to work I had submitted a job to compute1 with the following commands:

# Install Python 3.6
sudo yum -y install python36

# Install python-setuptools which will bring in easy_install
sudo yum -y install python36-setuptools

# Install pip using easy_install
sudo easy_install-3.6 pip

from the following post:

How do I install python 3 on google cloud console?

This had installed python3.6 on compute1 and that is why my jobs would run on compute1. However, I didn't think this script had run successfully I never submitted it to compute2, and therefore the jobs sent to compute2 could not call python3.6. For some reason I thought Slurm was using python3.6 from the login node since I had sourced a path to it in my sbatch submission.

After installing python3.6 on cluster2 I was then able to import all of my locally installed python libraries based on the following link by including

import sys
import os

sys.path.append(os.getcwd())

at the beginning of my python script.

How to import a local python module when using the sbatch command in SLURM

Collectives™ on Stack Overflow

Why can't I run multiple instances of the same python script simulataniously in SLURM

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related