I may need some slight clarification on the way torch.distributed handles creation of the default process group.
When calling _get_default_group(), a torch.distributed.ProcessGroupNCCL object is returned. From looking over the source code for DistributedDataParallel, this is what it calls internally when no process_group argument is supplied. However, when calling torch.distributed.group.WORLD, a different group object is returned. It’s not a ProcessGroupNCCL object either. If I pass this group from group.WORLD into DistributedDataParallel as the process_group, it will hang indefinitely on initialization (Using NCCL backend. I haven’t tested others). If I pass the _get_default_group() group into DistributedDataParallel it works as expected (Because it would call this by default anyway if process_group arg is None).
Can anyone clarify the discrepancy between these two process groups? The docs are light regarding torch.distributed so I’m trying to get a more complete understanding of this.