Hi, I tried DistributedDataParallel with nccl backend.
It works when I use 1 or 2 nodes (each with 4 V100).
However, error happens when further scaling to 3 or 4 nodes, and always a node with the following error (other nodes reports differently and looks correct).
gpu45:169732:170179 [0] transport/net_ib.cc:789 NCCL WARN NET/IB : Got completion with error 12, opcode 1, len 11155, vendor err 129
I tried Pytorch Version: 1.2-cuda10.0 1.4-cuda10.1 1.5-cuda10.1
And the nccl version is 2.4.8.
The nccl_debug info of 4 nodes are listed below: Node0, Node1, Node2, [Node3] (http://49.234.107.127:81/index.php/s/5B2wEFHSFCWSHfm) .
When I tried 4 nodes, the [ NCCL WARN NET/IB] always happens in the third node.
If I exclude the node (gpu45), and only run on the other three nodes, the [ NCCL WARN NET/IB] also happens.