0

I have been struggling trying to get multiple instances of a python script to run on SLURM. In my login node I have installed python3.6 and I have a python script "my_script.py" which takes a text file as input to read in run parameters. I can run this script on the login node using

python3.6 my_script.py input1.txt

Furthermore, I can submit a script submit.sh to run the job:

#!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample
#SBATCH --output=output1.txt
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G

python3.6 my_script.py input1.txt

This runs fine and executes as expected. However, if I submit the following script:

#!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample
#SBATCH --output=output2.txt
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G

python3.6 my_script.py input2.txt

while the first is running I get the following error message in output2.txt:

/var/spool/slurmd/job00130/slurm_script: line 9: python3.6: command not 
found

I found that I have this same issue when I try to submit a job as an array. For example, when I submit the following with sbatch:

!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample 
#SBATCH --output=out_%j.txt
#SBATCH --array=1-10
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G
echo PWD $PWD
cd $SLURM_SUBMIT_DIR
python3.6 my_script.py input_$SLURM_ARRAY_TASK_ID.txt
~  

I find that only out_1.txt shows that the job ran. All of the output files for tasks 2-10 show the same error message:

/var/spool/slurmd/job00130/slurm_script: line 9: python3.6: command not 

I am running all of these scripts in an HPC cluster that I created using the Compute Engine API in the Google Cloud Platform. I used the following tutorial to set up the SLURM cluster:

https://codelabs.developers.google.com/codelabs/hpc-slurm-on-gcp/#0

Why is SLURM unable to run multiple python3.6 jobs at the same time and how can I get my array submission to work? I have spent days going through SLURM FAQs and other stack questions but I have not found out a way to resolve this issue or a suitable explanation of whats causing the issue in the first place.

Thank you

6
  • 1
    Could it be that the first job runs on one machine, and the second on another? And that on this second node, Python3 is not installed? What is the structure of your cluster? Commented Oct 24, 2018 at 6:28
  • Welome to Stackoverflow! As Damien suggested, could you please edit your post to include the cluster configuration yaml file. Commented Oct 24, 2018 at 17:33
  • I could not locate the yaml file that I used to create the cluster, but I had used the template from the tutorial. I have now resolved the issue. Commented Oct 25, 2018 at 1:17
  • @damienfrancois I am now having issues getting my task array to run multiple tasks per node when specifying --cpus-per-task=1 and --tasks-per-node=2 in my submission script. Could you point me towards an example submission script to fill a node with tasks from a taks array based on the mem and cpu settings? I have not been able to find a good example to work from and I would really appreciate it. Commented Oct 25, 2018 at 2:15
  • Is slurm configured to run multiple jobs the same node? What is the value of SelectType in the configuration file? Commented Oct 25, 2018 at 8:14

1 Answer 1

0

I found out what I was doing wrong. I had created a cluster with two compute nodes, compute1 and compute2. At some point when I was trying to get things to work I had submitted a job to compute1 with the following commands:

# Install Python 3.6
sudo yum -y install python36

# Install python-setuptools which will bring in easy_install
sudo yum -y install python36-setuptools

# Install pip using easy_install
sudo easy_install-3.6 pip

from the following post:

How do I install python 3 on google cloud console?

This had installed python3.6 on compute1 and that is why my jobs would run on compute1. However, I didn't think this script had run successfully I never submitted it to compute2, and therefore the jobs sent to compute2 could not call python3.6. For some reason I thought Slurm was using python3.6 from the login node since I had sourced a path to it in my sbatch submission.

After installing python3.6 on cluster2 I was then able to import all of my locally installed python libraries based on the following link by including

import sys
import os

sys.path.append(os.getcwd()) 

at the beginning of my python script.

How to import a local python module when using the sbatch command in SLURM

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.