Hi,
I’m currently using torchrun to do distributed training.
I use init_process_group()
without any parameters in my code since it can automatically pick the backend based on the device type of my current environment: Distributed communication package - torch.distributed — PyTorch 2.7 documentation
My current environment has GPUs and I can see it being properly used with NCCL based on the NCCL logs generated by the NCCL_DEBUG env var.
Problem: However, torch.distributed.get_backend()
still returns ‘undefined’ even when the lazy initialization is done and the use of NCCL is so clear in my code. I also tried to manually put the default process group as its input parameter but it still returns the same. It only returns nccl
when I explicitly set init_process_group('nccl')
.
Question: Is this an expected behavior? How can I detect which backend torch is using if I use init_process_group
with no parameter?