Multi node distributed training: RuntimeError: NCCL error in /torch/lib/THD/base/data_channels/DataChannelNccl.cpp:322, unhandled system error

longjj · December 31, 2018, 9:45am

From this I found the solution.

If we use nvidia-docker, you need to add --network=host param in the docker run command in order to let the docker container use the same ip address as the host.