Dist.all_reduce_multigpu NCCL error messages

Hey all,
I am training my data using Distributed Data Parallel.
I have two separate machines and have 2 GPUs on each machine.
I’ve tried the codes in the documents like this

        tensor_list = []
        for dev_idx in range(torch.cuda.device_count()):
            tensor_list.append(torch.FloatTensor([1].cuda(dev_idx))

        dist.all_reduce_multigpu(tensor_list)

But got some error messages :

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352465323/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

Can anyone help me out with this?