Hi there
I’m trying to run the following simple script on two machines:
import torch.distributed
import time
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)
parser.add_argument("--global_rank", type=int)
args = parser.parse_args()
torch.cuda.set_device(args.local_rank)
print("init")
torch.distributed.init_process_group(
backend="nccl",
init_method="tcp://10.10.10.22:1191",
world_size=2,
rank=args.global_rank,
)
time.sleep(5)
print("barrier")
torch.distributed.barrier() # HANGS HERE
if __name__ == "__main__":
main()
I have two machines with the IPs 10.10.10.22 (master) and 10.10.10.25 with port 1191 open on both.
On master:
export NCCL_SOCKET_IFNAME=eno1
export NCCL_DEBUG_SUBSYS=ALL
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
python test.py --local_rank 0 --global_rank 0
lambda-server4:1793990:1793990 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1
lambda-server4:1793990:1793990 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eno1
lambda-server4:1793990:1793990 [0] NCCL INFO Bootstrap : Using [0]eno1:10.10.10.22<0>
lambda-server4:1793990:1793990 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
lambda-server4:1793990:1793990 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
lambda-server4:1793990:1793990 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1
lambda-server4:1793990:1793990 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eno1
lambda-server4:1793990:1793990 [0] NCCL INFO NET/Socket : Using [0]eno1:10.10.10.22<0>
lambda-server4:1793990:1793990 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
On the other machine:
export NCCL_SOCKET_IFNAME=enp49s0f1
export NCCL_DEBUG_SUBSYS=ALL
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
python test.py --local_rank 0 --global_rank 1
hyperplane1:1255526:1255526 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp49s0f1
hyperplane1:1255526:1255526 [1] NCCL INFO NCCL_SOCKET_IFNAME set to enp49s0f1
hyperplane1:1255526:1255526 [1] NCCL INFO Bootstrap : Using [0]enp49s0f1:10.10.10.25<0>
hyperplane1:1255526:1255526 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
hyperplane1:1255526:1255526 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
hyperplane1:1255526:1255526 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp49s0f1
hyperplane1:1255526:1255526 [1] NCCL INFO NCCL_SOCKET_IFNAME set to enp49s0f1
hyperplane1:1255526:1255526 [1] NCCL INFO NET/Socket : Using [0]enp49s0f1:10.10.10.25<0>
hyperplane1:1255526:1255526 [1] NCCL INFO Using network Socket
hyperplane1:1266304:1266392 [0] NCCL INFO Call to connect returned Connection timed out, retrying
hyperplane1:1266304:1266392 [0] NCCL INFO Call to connect returned Connection timed out, retrying
hyperplane1:1266304:1266392 [0] include/socket.h:403 NCCL WARN Connect to 10.10.10.22<49177> failed : Connection timed out
hyperplane1:1266304:1266392 [0] NCCL INFO bootstrap.cc:95 -> 2
hyperplane1:1266304:1266392 [0] NCCL INFO bootstrap.cc:309 -> 2
hyperplane1:1266304:1266392 [0] NCCL INFO init.cc:555 -> 2
hyperplane1:1266304:1266392 [0] NCCL INFO init.cc:840 -> 2
hyperplane1:1266304:1266392 [0] NCCL INFO group.cc:73 -> 2 [Async thread]
The program hangs at the barrier call and I don’t know why. I get past the init_process_group call on both machines so I assume the connection between the two servers is fine, but at the barrier it times out.
Does anyone see the problem here? I have probably missed a configuration step, but I don’t know what.
PyTorch 1.8.1
NCCL version 2.7.8+cuda11.1