What is the difference between dist.all_reduce_multigpu and dist.all_reduce

In single-node multi-gpu setup I have used dist.all_reduce. Will it work for multi-node multi-gpu setup ? or should I have to use dist.all_reduce_multigpu ?
In general, what;s the difference between both ?

Thanks for posting the question @Rakshith_V. multigpu version of collectives are being used when you have one rank manages multigpus, say if you have 2 nodes, each with 8 gpus, but you construct a world_size=2 processes, each process manages 8 gpus, you might need these collectives to do operations.

But if you construct a world_size = 16 processes, you can use all_reduce instead of all_reduce_multigpu for sure.

see the doc here Distributed communication package - torch.distributed — PyTorch 1.11.0 documentation

1 Like