I am training my data using Distributed Data Parallel.
I have two separate machines and have 2 GPUs on each machine.
I’ve tried the codes in the documents like this
tensor_list =  for dev_idx in range(torch.cuda.device_count()): tensor_list.append(torch.FloatTensor(.cuda(dev_idx)) dist.all_reduce_multigpu(tensor_list)
But got some error messages :
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352465323/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Can anyone help me out with this?