In single-node multi-gpu setup I have used dist.all_reduce. Will it work for multi-node multi-gpu setup ? or should I have to use dist.all_reduce_multigpu ?
In general, what;s the difference between both ?
Thanks for posting the question @Rakshith_V. multigpu version of collectives are being used when you have one rank manages multigpus, say if you have 2 nodes, each with 8 gpus, but you construct a world_size=2 processes, each process manages 8 gpus, you might need these collectives to do operations.
But if you construct a world_size = 16 processes, you can use all_reduce
instead of all_reduce_multigpu
for sure.
see the doc here Distributed communication package - torch.distributed — PyTorch 1.11.0 documentation
1 Like