Sorry for the naive question but I am confused about the integration of distributed training in a slurm cluster. Do we need to explicitly call the
distributed.launch when invoking the python script or is this taken care of automatically?
In other words, is this script correct?
#SBATCH -p <dummy_name>
eval "$(conda shell.bash hook)"
conda activate pytorch
python -m torch.distributed.launch --nproc_per_node=4 main.py
Your help would be highly appreciated.
The script looks fine, but you might want to replace the
launch command with
torchrun as the former is (or will be) deprecated. Are you able to check the GPU utilization on this node and could you check if all devices are used?
Thank you so much for your reply. I do not have the permission to log in to individual nodes and thus, I can not see utilization. Although, I can see that the batch size for individual node is getting divided by the number of GPUs(4).
That’s a good point. Additionally you could add a few print statements and make sure the data is pushed to all devices and the corresponding parameters are on the same device.
Just to make sure that I understood it correctly. It is not sufficient to have the slurm parameters or torchrun separately. We need to provide both of them for things to work.
I’m not a slurm expert and think it could be possible to let slurm handle the distributed run somehow.
However, I’m using slurm to setup the node and let PyTorch handle the actual DDP launch (which seems to also be your use case). Let’s wait if some slurm experts might give you more ideas.
Hi, @ptrblck can you please share more details on how you use slurm to setup the node?