I was trying to launch a distributed training for which a ProcessGroupNCCL is required but it seems like pybind11 is failing to convert my process group to NCCL.
The process is initialized as follows:
rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
local_rank = int(os.environ["OMPI_COMM_WORLD_LOCAL_RANK"])
world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"])
dist.init_distributed(dist_backend="nccl", rank=rank, world_size=world_size)
And the error I get is the following:
TypeError: ensure_nccl(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch.distributed.distributed_c10d.ProcessGroupNCCL, arg1: torch.Tensor) -> None
Invoked with: <torch.distributed.distributed_c10d.ProcessGroup object at 0x7fa630e71cb0>, tensor([[ 0.8793, 1.1869, -1.2402, ..., 0.7247, -0.2372, -0.3371],
Looking at the source code of init_distributed()
when I pass “NCCL” as backend the process group should be defined as torch.distributed.distributed_c10d.ProcessGroupNCCL
but is not.