ProcessGroup not converted to ProcessGroupNCCL

I was trying to launch a distributed training for which a ProcessGroupNCCL is required but it seems like pybind11 is failing to convert my process group to NCCL.

The process is initialized as follows:

rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
local_rank = int(os.environ["OMPI_COMM_WORLD_LOCAL_RANK"])
world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"])

dist.init_distributed(dist_backend="nccl", rank=rank, world_size=world_size)

And the error I get is the following:

TypeError: ensure_nccl(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch.distributed.distributed_c10d.ProcessGroupNCCL, arg1: torch.Tensor) -> None

Invoked with: <torch.distributed.distributed_c10d.ProcessGroup object at 0x7fa630e71cb0>, tensor([[ 0.8793,  1.1869, -1.2402,  ...,  0.7247, -0.2372, -0.3371],

Looking at the source code of init_distributed() when I pass “NCCL” as backend the process group should be defined as torch.distributed.distributed_c10d.ProcessGroupNCCL but is not.

init_distributed looks like a different API, so i assume this is a wrapper around torch.distributed.init_process_group. Since PyTorch 2.0 (Release PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever · pytorch/pytorch · GitHub)-,Distributed%20(c10d),-Dispatchable%20collectives%3A%20An), init_process_group no longer returns specific processgroup instances (e.g. ProcessGroupNCCL, ProcessGroupGloo, etc.) but insteads returns a generic ProcessGroup instance. This ProcessGroup may have multiple backends. Probably the fix is to just update the ensure_nccl function to check for torch.distributed.distributed_c10d.ProcessGroup instead of torch.distributed.distributed_c10d.ProcessGroupNCCL

Actually I typed wrong it is just:

import torch.distributed as dist

But I think your observation is still valid, I’ll check and see what happens.