@alchemi5t If you’re running processes on two machines, they won’t be able to talk if you’re using localhost (127.0.0.1) for the address of rank 0 in the initialization method. It must be an IP that’s reachable from all other ranks. In the example here, rank 1 was trying to connect to rank 0 over 127.0.0.1.
I was running it on one machine with 4 cards in it( trying to train only on 2). I fixed my problem by installing and using nvidia Apex(apex.parallel.multiproc).
Not sure why I had to do this, because I’ve seen people use the same script without any hacks like this.