Hi, noobie here, I have two ec2 instances which won’t connect with gloo.
Things I’ve tried:
- The script works when I run it on two shells from the same machine (the rank 0 machine)
- I can create a simple connection between the two instances with netcat
- I’ve set the GLOO_SOCKET_IFNAME to ‘ens5’ after consulting ifconfig(1)
- I’ve confirmed both the rank 0 and rank 1 machine are stuck on
Here’s the code: distributed_training.py · GitHub