Gloo won't connect between AWS ec2 instances

Hi, noobie here, I have two ec2 instances which won’t connect with gloo.

Things I’ve tried:

  • The script works when I run it on two shells from the same machine (the rank 0 machine)
  • I can create a simple connection between the two instances with netcat
  • I’ve set the GLOO_SOCKET_IFNAME to ‘ens5’ after consulting ifconfig(1)
  • I’ve confirmed both the rank 0 and rank 1 machine are stuck on init_process_group

Here’s the code: distributed_training.py · GitHub

Some more information:

Turning verbosity up showed that the connection is succeeding, but is hanging after the connection is established.

Rank 0: [I socket.cpp:297] [c10d - debug] The server socket on [::]:23456 has accepted a connection from [ip-….us-west-2.compute.internal]:5740

Rank 1: [I TCPStore.cpp:261] [c10d - debug] TCP client connected to host …:23456

I’m using ipv4.

Posted to Stack Exchange: