Distributed.init_process_group failure


I am trying to get started with torch.distributed with the following toy example, on a multi-gpu cluster :

After running the program with the following command :

python3 main.py --init-method tcp:// --rank 0 --world-size 2

The program gets stuck in an the dist.init_process_group on line 42. I am not really sure about the reason as no message gets displayed.


it’s waiting for both ranks to reach that line to actually initialize the proc group.

Also see the docs for the torch.distributed.launch tool.

I have launched all the node, but the program still gets stuck in the init_process_group.

have solved. it is the problem about communication between nodes.

How did you solve this?

@alchemi5t If you’re running processes on two machines, they won’t be able to talk if you’re using localhost ( for the address of rank 0 in the initialization method. It must be an IP that’s reachable from all other ranks. In the example here, rank 1 was trying to connect to rank 0 over

Hi @pietern,

I was running it on one machine with 4 cards in it( trying to train only on 2). I fixed my problem by installing and using nvidia Apex(apex.parallel.multiproc).

Not sure why I had to do this, because I’ve seen people use the same script without any hacks like this.

Very odd. Especially since Apex also uses torch.distributed under the hood.