ericauld
(Eric Auld)
1
Hi, noobie here, I have two ec2 instances which won’t connect with gloo.
Things I’ve tried:
- The script works when I run it on two shells from the same machine (the rank 0 machine)
- I can create a simple connection between the two instances with netcat
- I’ve set the GLOO_SOCKET_IFNAME to ‘ens5’ after consulting ifconfig(1)
- I’ve confirmed both the rank 0 and rank 1 machine are stuck on
init_process_group
Here’s the code: distributed_training.py · GitHub
ericauld
(Eric Auld)
2
Some more information:
Turning verbosity up showed that the connection is succeeding, but is hanging after the connection is established.
Rank 0: [I socket.cpp:297] [c10d - debug] The server socket on [::]:23456 has accepted a connection from [ip-….us-west-2.compute.internal]:5740
Rank 1: [I TCPStore.cpp:261] [c10d - debug] TCP client connected to host …:23456
I’m using ipv4.
ericauld
(Eric Auld)
3
Posted to Stack Exchange: