Unified multi-gpu and multi-node best practices?

assaf · May 30, 2022, 9:14pm

Hi all,
What’s the best practice for running either a single-node-multi-gpu or multi-node-multi-gpu? In particular I’m using Slurm to allocate the resources, and while it is possible to select the number of nodes and the number of GPUs per node, I prefer to request for the number of GPUs and let Slurm handle the allocation.

The thing is, there are two possible cases:

Slurm allocated all of the GPUs on the same node.
Slurm allocated the GPUs on multiple nodes.

It is important to mention that the allocation request is for X tasks (processes), and 1 GPU per task. So eventually there’ll be X tasks and X GPUs available.

I’ve noticed that using “torchrun” with the argument of “–nproc_per_node” set to a number larger than 1 will create new processes (tasks), which are redundant and duplicated as the tasks where already allocated by Slurm, but without setting this argument to the correct number of tasks the training won’t start at all. Saying that, it seems that I should allocate only a single task for X GPUs, and let “torchrun” create extra processes, but this won’t be the best practice for DDP, and probably won’t work for multi-node runs.

I understand that “torchrun” probably can’t really know how many times it was called, and therefore can’t set the correct “local_rank” argument (and other arguments?).

Is there a single best practice for the code / for starting “torchrun” that will work for either a multi-GPU run or multi-node runs?

Thanks,
Assaf.