1

I am trying to run a pytorch application that uses DDP and MPI as a communication backend. I am running it on a cluster that has two network interfaces on all the nodes, a fast ethernet network interface and an infiniband network.

In running with srun, how can I specify the network interface tot be used? I saw that when using just MPI I can add "iface ib0" to my mpirun command. But how do I achievef the same thing when working with slurm.

I have attached below a sample script that I want to use. Can someone verify if it is the right thing to do?

#!/bin/bash
#SBATCH --job-name=resnet50_cifar100_job
#SBATCH --output=resnet50_cifar100_output_opx_%j.txt
#SBATCH --error=resnet50_cifar100_error_opx_%j.txt
#SBATCH --ntasks=16                
#SBATCH --nodes=4             
#SBATCH --ntasks-per-node=4        

# Source the environment setup script
source $HOME/activate_environment.sh

# Activate the Python virtual environment
source $HOME/torch_mpi_env/bin/activate


#export FI_TCP_IFACE=ib0
#export FI_PROVIDER=psm2
#export I_MPI_FABRICS=ofi
#export I_MPI_FALLBACK=0
 
export I_MPI_DEBUG=5
export I_MPI_FABRICS=shm:ofi
export I_MPI_OFI_PROVIDER=psm2

export MPIP="-f ./mpip_results"

export SLURM_NETWORK=ib0

# Run the Python script
srun --mpi=pmi2 --network=ib0 \
     --export=ALL,LD_PRELOAD=$HOME/mpiP_build/lib/libmpiP.so \
     python $HOME/torch_projects/resnet50_cifar100.py --epochs 200


# Deactivate the virtual environment
deactivate

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.