Distributed training on slurm cluster

Sorry for the naive question but I am confused about the integration of distributed training in a slurm cluster. Do we need to explicitly call the distributed.launch when invoking the python script or is this taken care of automatically?

In other words, is this script correct?

#!/bin/bash
#SBATCH -p <dummy_name>
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --gres=gpu:Tesla-V100-32GB:4
#SBATCH --cpus-per-task=2
#SBATCH --mem=60G
#SBATCH --job-name=knee_eval_ad_ax0
#SBATCH --output=slurm.out

eval "$(conda shell.bash hook)"
conda activate pytorch

python -m torch.distributed.launch --nproc_per_node=4 main.py 

Your help would be highly appreciated.

Thanks,
Chinmay

The script looks fine, but you might want to replace the launch command with torchrun as the former is (or will be) deprecated. Are you able to check the GPU utilization on this node and could you check if all devices are used?

1 Like

Thank you so much for your reply. I do not have the permission to log in to individual nodes and thus, I can not see utilization. Although, I can see that the batch size for individual node is getting divided by the number of GPUs(4).

That’s a good point. Additionally you could add a few print statements and make sure the data is pushed to all devices and the corresponding parameters are on the same device.

1 Like

Just to make sure that I understood it correctly. It is not sufficient to have the slurm parameters or torchrun separately. We need to provide both of them for things to work.

I’m not a slurm expert and think it could be possible to let slurm handle the distributed run somehow.
However, I’m using slurm to setup the node and let PyTorch handle the actual DDP launch (which seems to also be your use case). Let’s wait if some slurm experts might give you more ideas.

2 Likes

Hi, @ptrblck can you please share more details on how you use slurm to setup the node?