Nccl error in torch._C._dist_broadcast(tensor, src, group) when train in two nodes

after finding the NCCL_SOCKET_IFNAME flag, what you have to do is:

ifconfig
# check the ethernet interface name: e.g. eth0
NCCL_SOCKET_IFNAME=eth0 python your_script.py parameters

best

1 Like