Sorry for the naive question but I am confused about the integration of distributed training in a slurm cluster. Do we need to explicitly call the distributed.launch when invoking the python script or is this taken care of automatically?
The script looks fine, but you might want to replace the launch command with torchrun as the former is (or will be) deprecated. Are you able to check the GPU utilization on this node and could you check if all devices are used?
Thank you so much for your reply. I do not have the permission to log in to individual nodes and thus, I can not see utilization. Although, I can see that the batch size for individual node is getting divided by the number of GPUs(4).
That’s a good point. Additionally you could add a few print statements and make sure the data is pushed to all devices and the corresponding parameters are on the same device.
Just to make sure that I understood it correctly. It is not sufficient to have the slurm parameters or torchrun separately. We need to provide both of them for things to work.
I’m not a slurm expert and think it could be possible to let slurm handle the distributed run somehow.
However, I’m using slurm to setup the node and let PyTorch handle the actual DDP launch (which seems to also be your use case). Let’s wait if some slurm experts might give you more ideas.
There is an excellent tutorial on distributed training with pytorch, under SLURM, from Princeton, here.. This is my submission job script, with containers utilizing singularity
#!/bin/bash
#SBATCH --job-name=COOL_JOB_NAME # create a short name for your job
#SBATCH --nodes=20
#SBATCH --ntasks-per-node=4 # total number of tasks per node
#SBATCH --gres=gpu:4 # number of gpus per node
#SBATCH --cpus-per-task=4 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=240GB # total memory per node
#SBATCH --time=23:59:00 # total run time limit (HH:MM:SS)
##### Number of total processes
echo "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX "
echo "Nodelist:= " $SLURM_JOB_NODELIST
echo "Number of nodes:= " $SLURM_JOB_NUM_NODES
echo "Ntasks per node:= " $SLURM_NTASKS_PER_NODE
echo "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX "
# If you want to load things from your .bashrc profile, e.g. cuda drivers, singularity etc
source ~/.bashrc
# ******************* These are read internally it seems ***********************************
# ******** Master port, address and world size MUST be passed as variables for DDP to work
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "MASTER_PORT"=$MASTER_PORT
echo "WORLD_SIZE="$WORLD_SIZE
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
echo "MASTER_ADDR="$MASTER_ADDR
# ******************************************************************************************
# zoom zoom - recommended from lightning
export NCCL_NSOCKS_PERTHREAD=4
export NCCL_SOCKET_NTHREADS=2
export NCCL_MIN_CHANNELS=32
echo "Run started at:- "
date
# Actual run of script
#srun python main.py # Use this if you have python in your environment
srun singularity exec --nv /Location/Of/Your/Containers/pytorch_22.05-py3.sif python main.py
echo ""
echo "################################################################"
echo "@@@@@@@@@@@@@@@@@@ Run completed at:- @@@@@@@@@@@@@@@@@@@@@@@@@"
date
echo "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@"
echo "################################################################"
Hi folks, I’ve been fighting with this for the past couple of days and I’d like to get some advice.
So the initial question is running torchrun while SLURM is scheduling the job (so within a sbatch script). Here is an old example of this working: https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/slurm/sbatch_run.sh#L17-L23. Please note this requires modifying the rendezvous endpoint etc. I think setting these using SLURM env vars is key so torchrun knows what resources it has. I was able to run this on a single node, however for two nodes my job would just hang at the “rendezvous’ing worker group” part so wouldn’t even initialize. The code example in the link didn’t work but it is rather old.
For more context, I am able to run without torchrun for multi-node-pytroch SLURM scheduled jobs (as the previous excellent comment suggested) but this isn’t ideal as it would require more code modification. However, I do need to go through a container layer so these commands are exactly what I needed. The link above doesn’t need to go through a container, like singularity, so perhaps it will not work with torchrun. Thoughts?
Any help on setting up the rendezvous args or a better explanation than the docs would be greatly appreciated! I’ll provide code snippets/outputs when I hear from someone and I’m by my computer