Ddp: diff between dist.all_gather and dist.all_gather_multigpu?

not sure about that.
they are both used to sync one tensor.
the output of both methods is a list of tensors where each tensor comes from a process.
then, you can merge this list as you like to get the final synched tensor as in here.
broadcasting is done using torch.dist.broadcast i think, and it copies a tensor from source and diffuses it to the rest. all_gather copies all tensors across all process into a list and makes sure that all processes have the same exact list.