I started 4 processes on 2 GPUs (2 processes in 1 gpu). However, when I try to use distributed.all_gather_object(…) to collect the data (as follow)
torch.distributed.all_gather_object(all_process_list, [data])
I will receive the following error:
...
work = default_pg.allgather([tensor_list], [tensor])
torch.distributedwork = default_pg.allgather([tensor_list], [tensor]).
DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.1
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 46000
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.1
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 1 and rank 3 both on CUDA device 49000
all_gather(object_size_list, local_size, group=group)
...
It seems that torch.distributed.all_gather_object
does not support to collect data from the processes on the same device. Does anyone know if there is anyway to fix it?