NCCL error when using multi node configuration

I’m using detectron2, a tool for object detection based on Pytorch, on a HPC machine having 4 gpu per node. When I use one node it works well, when I try to launch on multiple nodes, using fo example 2 nodes and 8 gpus in total (4 gpu per node) I get:

File "*****/view/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3145, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: ***/spack-src/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3

From what I found in pytorch source code, such line raise the following error:

"Tensor list mustn't be larger than the number of available GPUs"

The following is the output of NCCL_INFO:

`lrdn1629:236813:236813 [1] NCCL INFO cudaDriverVersion 11080
lrdn1629:236813:236813 [1] NCCL INFO Bootstrap : Using ib0:10.128.31.149<0>
lrdn1629:236813:236813 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
lrdn1629:236813:236852 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [RO]; OOB ib0:10.128.31.149<0>
lrdn1629:236813:236852 [1] NCCL INFO Using network IB
lrdn1629:236813:236852 [1] NCCL INFO Call to connect returned Connection refused, retrying
lrdn1629:236813:236852 [1] NCCL INFO Call to connect returned Connection refused, retrying
lrdn1629:236813:236852 [1] NCCL INFO Call to connect returned Connection refused, retrying

lrdn1629:236813:236852 [1] NCCL INFO Call to connect returned Connection refused, retrying

lrdn1629:236813:236852 [1] misc/socket.cc:456 NCCL WARN Net : Connect to 10.128.31.153<46829> failed : Connection refused
lrdn1629:236813:236852 [1] NCCL INFO bootstrap.cc:256 → 6
lrdn1629:236813:236852 [1] NCCL INFO init.cc:516 → 6
lrdn1629:236813:236852 [1] NCCL INFO init.cc:1089 → 6
lrdn1629:236813:236852 [1] NCCL INFO group.cc:64 → 6 [Async thread]
lrdn1629:236813:236813 [1] NCCL INFO group.cc:421 → 3
lrdn1629:236813:236813 [1] NCCL INFO group.cc:106 → 3
lrdn1629:236813:236813 [1] NCCL INFO comm 0x3843eb10 rank 1 nranks 8 cudaDev 1 busId 56000 - Abort COMPLETE`

When I run using 2 nodes and 2 gpus per node, the job seems stuck, no error but also no progress. The same code works well in other HPC machines also using multiple nodes.

Already ran Nccl-test and it works well. Any suggestion ? Thanks.