Gather data from multiple processes in one gpu

I started 4 processes on 2 GPUs (2 processes in 1 gpu). However, when I try to use distributed.all_gather_object(…) to collect the data (as follow)

torch.distributed.all_gather_object(all_process_list, [data])

I will receive the following error:

...
    work = default_pg.allgather([tensor_list], [tensor])
    torch.distributedwork = default_pg.allgather([tensor_list], [tensor]).
DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.1
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 46000
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.1
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 1 and rank 3 both on CUDA device 49000
    all_gather(object_size_list, local_size, group=group)
...

It seems that torch.distributed.all_gather_object does not support to collect data from the processes on the same device. Does anyone know if there is anyway to fix it?

cc: @wconstab would you please help?

NCCL does not support having multiple ranks on the same GPU.
You can use Gloo (CPU communication library) to gather the objects in CPU memory.

Method 1: create a separate ProcessGroup with Gloo support

cpu_pg = dist.new_group(backend="gloo")
dist.all_gather_object(..., group=cpu_pg)  # specify separate group

Method 2: create a ProcessGroup with both NCCL and Gloo support

dist.init_process_group("cuda:nccl,cpu:gloo")
dist.all_gather_object(...)  # no need to specify separate group
1 Like