Distributed.init_process_group failure

Hello,

I am trying to get started with torch.distributed with the following toy example, on a multi-gpu cluster :

After running the program with the following command :

python3 main.py --init-method tcp://127.0.0.1:23456 --rank 0 --world-size 2

The program gets stuck in an the dist.init_process_group on line 42. I am not really sure about the reason as no message gets displayed.

Thanks,

it’s waiting for both ranks to reach that line to actually initialize the proc group.

1 Like

Also see the docs for the torch.distributed.launch tool.

1 Like

I have launched all the node, but the program still gets stuck in the init_process_group.

have solved. it is the problem about communication between nodes.

How did you solve this?

@alchemi5t If you’re running processes on two machines, they won’t be able to talk if you’re using localhost (127.0.0.1) for the address of rank 0 in the initialization method. It must be an IP that’s reachable from all other ranks. In the example here, rank 1 was trying to connect to rank 0 over 127.0.0.1.

Hi @pietern,

I was running it on one machine with 4 cards in it( trying to train only on 2). I fixed my problem by installing and using nvidia Apex(apex.parallel.multiproc).

Not sure why I had to do this, because I’ve seen people use the same script without any hacks like this.

Very odd. Especially since Apex also uses torch.distributed under the hood.