What’s the best practice for running either a single-node-multi-gpu or multi-node-multi-gpu? In particular I’m using Slurm to allocate the resources, and while it is possible to select the number of nodes and the number of GPUs per node, I prefer to request for the number of GPUs and let Slurm handle the allocation.
The thing is, there are two possible cases:
Slurm allocated all of the GPUs on the same node.
Slurm allocated the GPUs on multiple nodes.
It is important to mention that the allocation request is for X tasks (processes), and 1 GPU per task. So eventually there’ll be X tasks and X GPUs available.
I’ve noticed that using “torchrun” with the argument of “–nproc_per_node” set to a number larger than 1 will create new processes (tasks), which are redundant and duplicated as the tasks where already allocated by Slurm, but without setting this argument to the correct number of tasks the training won’t start at all. Saying that, it seems that I should allocate only a single task for X GPUs, and let “torchrun” create extra processes, but this won’t be the best practice for DDP, and probably won’t work for multi-node runs.
I understand that “torchrun” probably can’t really know how many times it was called, and therefore can’t set the correct “local_rank” argument (and other arguments?).
Is there a single best practice for the code / for starting “torchrun” that will work for either a multi-GPU run or multi-node runs?
Thank you for the answer.
Tried so many things, torchrun just doesn’t want to work in a multi-node manner, probably due to IB not being correctly exposed when dealing with Slurm and containers. Also tried with MPI backend, doesn’t work. Tried native DDP without torchrun but it also doesn’t work, the nodes can’t recognize each other and nothing starts, I think that only Horovod works in someway, but I didn’t reach yet to see how.
This is so complicated I’m actually giving up. I’m so lost and no documentation is available anywhere.
I’m not familiar with torchx, I will try to read about it.
@assaf sorry that you tried many things and none of them worked out. We are trying to improve our setup tutorials in different environment and one important tutorial we plan to add is the Slurm setup, this haven’t been done yet. But there’s a existing repo that on Slurm setup already, could you check it and see if it’s helpful for you? disttraining/slurm at main · aivanou/disttraining · GitHub
Thanks @wanchaol, sort of tried it, while also trying a simpler syntetic code but running it with “torch.distributed.run” (which I think is deprecated?) instead of “torchrun” as in the script, getting some networking issues or other issues. So I guess it doesn’t work at the moment.
I switched to Horovod as it works perfectly, until any update will be made.