Unified multi-gpu and multi-node best practices?

Hi all,
What’s the best practice for running either a single-node-multi-gpu or multi-node-multi-gpu? In particular I’m using Slurm to allocate the resources, and while it is possible to select the number of nodes and the number of GPUs per node, I prefer to request for the number of GPUs and let Slurm handle the allocation.

The thing is, there are two possible cases:

  1. Slurm allocated all of the GPUs on the same node.
  2. Slurm allocated the GPUs on multiple nodes.

It is important to mention that the allocation request is for X tasks (processes), and 1 GPU per task. So eventually there’ll be X tasks and X GPUs available.

I’ve noticed that using “torchrun” with the argument of “–nproc_per_node” set to a number larger than 1 will create new processes (tasks), which are redundant and duplicated as the tasks where already allocated by Slurm, but without setting this argument to the correct number of tasks the training won’t start at all. Saying that, it seems that I should allocate only a single task for X GPUs, and let “torchrun” create extra processes, but this won’t be the best practice for DDP, and probably won’t work for multi-node runs.

I understand that “torchrun” probably can’t really know how many times it was called, and therefore can’t set the correct “local_rank” argument (and other arguments?).

Is there a single best practice for the code / for starting “torchrun” that will work for either a multi-GPU run or multi-node runs?

Thanks,
Assaf.

As a general guideline, allocating hosts instead of individual GPUs will make your like a lot easier. There are a lot of pitfalls in allocating at the GPU granularity.

For example, NCCL needs all GPUs of a host to be part of a collective in other to reliably use NVLinks.

Most libraries assume a heterogeneous cluster allocation when partitioning work. A common assumption is (world_size % local_size) == 0 and that local_size is constant across hosts.

On the specifics of your issue. torchrun won’t be able to figure out the right local rank since you had SLURM run multiple tasks the same host.

I’m not sure on what you’re referring here as a bad practice. But torchrun will do exactly what DDP wants, which is multiple processes, one per-gpu with LOCAL_RANK telling you which gpu to use.

This will work in a multi-node run as ranks as assigned as part of the initial rendezvous.

Have you considered using torchx? It has slurm support which should address your concerns.

Thank you for the answer.
Tried so many things, torchrun just doesn’t want to work in a multi-node manner, probably due to IB not being correctly exposed when dealing with Slurm and containers. Also tried with MPI backend, doesn’t work. Tried native DDP without torchrun but it also doesn’t work, the nodes can’t recognize each other and nothing starts, I think that only Horovod works in someway, but I didn’t reach yet to see how.

This is so complicated I’m actually giving up. I’m so lost and no documentation is available anywhere.
I’m not familiar with torchx, I will try to read about it.

Thanks anyway.
Assaf.

@assaf sorry that you tried many things and none of them worked out. We are trying to improve our setup tutorials in different environment and one important tutorial we plan to add is the Slurm setup, this haven’t been done yet. But there’s a existing repo that on Slurm setup already, could you check it and see if it’s helpful for you? disttraining/slurm at main · aivanou/disttraining · GitHub

Thanks @wanchaol, sort of tried it, while also trying a simpler syntetic code but running it with “torch.distributed.run” (which I think is deprecated?) instead of “torchrun” as in the script, getting some networking issues or other issues. So I guess it doesn’t work at the moment.

I switched to Horovod as it works perfectly, until any update will be made.

Assaf.

This is old now but I have been doing similar things and struggling with inconsistent or unclear documentation around this. @kumpera is right in that allocating the number of hosts is much easier than allocating the number of tasks. If you can manage with single-node multi-gpu then definitely do this and avoid multi-node multi-gpu if possible as it’s much harder to configure.

If you’re using slurm you don’t need to have WORLD_SIZE and RANK variables initialised and can infer these values from different slurm environment variables like SLURM_NTASKS and SLURM_PROCID.

For the multi-node setting, I initialised the distributed environment through rpc.init_rpc calls and passing global ranks retrieved from SLURM_PROCID. You can then call script using slurm by using something like:

srun torchrun \
--nnodes ${nodes} \
--nproc_per_node ${procs} \
--rdzv_endpoint ${master_addr} \
--rdzv_endpoint c10d \
main.py

within a .sh or .slurm file by also specifying slurm parameters such as

#SBATCH --nnodes=${nodes}
#SBATCH --ntasks_per_node=${procs}

This still might not be that clear and I can post example scripts I’ve used. I have been trying to solve similar issues and not having much HPC/distributed programming background, it’s been challenging and I think the docs could be clearer/more detailed and tutorials explained better.

1 Like