Distributed training on slurm cluster

Sorry for the naive question but I am confused about the integration of distributed training in a slurm cluster. Do we need to explicitly call the distributed.launch when invoking the python script or is this taken care of automatically?

In other words, is this script correct?

#!/bin/bash
#SBATCH -p <dummy_name>
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --gres=gpu:Tesla-V100-32GB:4
#SBATCH --cpus-per-task=2
#SBATCH --mem=60G
#SBATCH --job-name=knee_eval_ad_ax0
#SBATCH --output=slurm.out

eval "$(conda shell.bash hook)"
conda activate pytorch

python -m torch.distributed.launch --nproc_per_node=4 main.py 

Your help would be highly appreciated.

Thanks,
Chinmay

1 Like

The script looks fine, but you might want to replace the launch command with torchrun as the former is (or will be) deprecated. Are you able to check the GPU utilization on this node and could you check if all devices are used?

1 Like

Thank you so much for your reply. I do not have the permission to log in to individual nodes and thus, I can not see utilization. Although, I can see that the batch size for individual node is getting divided by the number of GPUs(4).

That’s a good point. Additionally you could add a few print statements and make sure the data is pushed to all devices and the corresponding parameters are on the same device.

1 Like

Just to make sure that I understood it correctly. It is not sufficient to have the slurm parameters or torchrun separately. We need to provide both of them for things to work.

I’m not a slurm expert and think it could be possible to let slurm handle the distributed run somehow.
However, I’m using slurm to setup the node and let PyTorch handle the actual DDP launch (which seems to also be your use case). Let’s wait if some slurm experts might give you more ideas.

2 Likes

Hi, @ptrblck can you please share more details on how you use slurm to setup the node?

There is an excellent tutorial on distributed training with pytorch, under SLURM, from Princeton, here.. This is my submission job script, with containers utilizing singularity

#!/bin/bash

#SBATCH --job-name=COOL_JOB_NAME    # create a short name for your job
#SBATCH --nodes=20
#SBATCH --ntasks-per-node=4      # total number of tasks per node
#SBATCH --gres=gpu:4            # number of gpus per node
#SBATCH --cpus-per-task=4        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=240GB               # total memory per node 
#SBATCH --time=23:59:00          # total run time limit (HH:MM:SS)




##### Number of total processes 
echo "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX "
echo "Nodelist:= " $SLURM_JOB_NODELIST
echo "Number of nodes:= " $SLURM_JOB_NUM_NODES
echo "Ntasks per node:= "  $SLURM_NTASKS_PER_NODE
echo "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX "

# If you want to load things from your .bashrc profile, e.g. cuda drivers, singularity etc 
source ~/.bashrc


# ******************* These are read internally it seems ***********************************
# ******** Master port, address and world size MUST be passed as variables for DDP to work 
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "MASTER_PORT"=$MASTER_PORT
echo "WORLD_SIZE="$WORLD_SIZE

master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
echo "MASTER_ADDR="$MASTER_ADDR
# ******************************************************************************************

# zoom zoom - recommended from lightning
export NCCL_NSOCKS_PERTHREAD=4
export NCCL_SOCKET_NTHREADS=2
export NCCL_MIN_CHANNELS=32


echo "Run started at:- "
date

# Actual run of script 
#srun python main.py # Use this if you have python in your environment
srun singularity exec --nv /Location/Of/Your/Containers/pytorch_22.05-py3.sif python main.py
echo ""
echo "################################################################"
echo "@@@@@@@@@@@@@@@@@@ Run completed at:- @@@@@@@@@@@@@@@@@@@@@@@@@"
date
echo "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@"
echo "################################################################"

2 Likes

Hi folks, I’ve been fighting with this for the past couple of days and I’d like to get some advice.

So the initial question is running torchrun while SLURM is scheduling the job (so within a sbatch script). Here is an old example of this working: https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/slurm/sbatch_run.sh#L17-L23. Please note this requires modifying the rendezvous endpoint etc. I think setting these using SLURM env vars is key so torchrun knows what resources it has. I was able to run this on a single node, however for two nodes my job would just hang at the “rendezvous’ing worker group” part so wouldn’t even initialize. The code example in the link didn’t work but it is rather old.

For more context, I am able to run without torchrun for multi-node-pytroch SLURM scheduled jobs (as the previous excellent comment suggested) but this isn’t ideal as it would require more code modification. However, I do need to go through a container layer so these commands are exactly what I needed. The link above doesn’t need to go through a container, like singularity, so perhaps it will not work with torchrun. Thoughts?

Any help on setting up the rendezvous args or a better explanation than the docs would be greatly appreciated! I’ll provide code snippets/outputs when I hear from someone and I’m by my computer :slight_smile:

1 Like