Clarification on Distributed's default process group

I may need some slight clarification on the way torch.distributed handles creation of the default process group.

When calling _get_default_group(), a torch.distributed.ProcessGroupNCCL object is returned. From looking over the source code for DistributedDataParallel, this is what it calls internally when no process_group argument is supplied. However, when calling torch.distributed.group.WORLD, a different group object is returned. It’s not a ProcessGroupNCCL object either. If I pass this group from group.WORLD into DistributedDataParallel as the process_group, it will hang indefinitely on initialization (Using NCCL backend. I haven’t tested others). If I pass the _get_default_group() group into DistributedDataParallel it works as expected (Because it would call this by default anyway if process_group arg is None).

Can anyone clarify the discrepancy between these two process groups? The docs are light regarding torch.distributed so I’m trying to get a more complete understanding of this.

The difference is in that torch.distributed.group.WORLD is a constant that identifies the global process group. All functions prefixed with an underscore are not part of the public API and should not be used if you expect version-to-version compatibility.

For now you can continue to specify process_group=None to make it pick up the default process group. I created https://github.com/pytorch/pytorch/issues/17305 to track fixing the issue you mention. It should be transparent, just like all functions in the torch.distributed module.

1 Like

Thank you for the reply! That helps!