DDP Error for more than 2 nodes

Hi, I tried DistributedDataParallel with nccl backend.

It works when I use 1 or 2 nodes (each with 4 V100).

However, error happens when further scaling to 3 or 4 nodes, and always a node with the following error (other nodes reports differently and looks correct).

gpu45:169732:170179 [0] transport/net_ib.cc:789 NCCL WARN NET/IB : Got completion with error 12, opcode 1, len 11155, vendor err 129

I tried Pytorch Version: 1.2-cuda10.0 1.4-cuda10.1 1.5-cuda10.1
And the nccl version is 2.4.8.

The nccl_debug info of 4 nodes are listed below: Node0, Node1, Node2, [Node3] (http://49.234.107.127:81/index.php/s/5B2wEFHSFCWSHfm) .

When I tried 4 nodes, the [ NCCL WARN NET/IB] always happens in the third node.
If I exclude the node (gpu45), and only run on the other three nodes, the [ NCCL WARN NET/IB] also happens.

This might be relevant to this issue in NCCL repo, and this comment seems fixed it.

Thanks a lot!
I give up using NCCL temporarily and it should relate to the system setting.