Unified multi-gpu and multi-node best practices?

As a general guideline, allocating hosts instead of individual GPUs will make your like a lot easier. There are a lot of pitfalls in allocating at the GPU granularity.

For example, NCCL needs all GPUs of a host to be part of a collective in other to reliably use NVLinks.

Most libraries assume a heterogeneous cluster allocation when partitioning work. A common assumption is (world_size % local_size) == 0 and that local_size is constant across hosts.

On the specifics of your issue. torchrun won’t be able to figure out the right local rank since you had SLURM run multiple tasks the same host.

I’m not sure on what you’re referring here as a bad practice. But torchrun will do exactly what DDP wants, which is multiple processes, one per-gpu with LOCAL_RANK telling you which gpu to use.

This will work in a multi-node run as ranks as assigned as part of the initial rendezvous.

Have you considered using torchx? It has slurm support which should address your concerns.