I am trying to get started with torch.distributed with the following toy example, on a multi-gpu cluster :
This file has been truncated.
from random import randint
from time import sleep
import torch.distributed as dist
def run(world_size, rank, steps):
for step in range(1, steps + 1):
# get random int
value = randint(0, 10)
# group all ranks
ranks = list(range(world_size))
group = dist.new_group(ranks=ranks)
# compute reduced sum
tensor = torch.tensor(value, dtype=torch.int)
dist.all_reduce(tensor, op=dist.ReduceOp.SUM, group=group)
After running the program with the following command :
python3 main.py --init-method tcp://127.0.0.1:23456 --rank 0 --world-size 2
The program gets stuck in an the dist.init_process_group on line 42. I am not really sure about the reason as no message gets displayed.
it’s waiting for both ranks to reach that line to actually initialize the proc group.
Also see the docs for the
I have launched all the node, but the program still gets stuck in the init_process_group.
have solved. it is the problem about communication between nodes.
@alchemi5t If you’re running processes on two machines, they won’t be able to talk if you’re using localhost (127.0.0.1) for the address of rank 0 in the initialization method. It must be an IP that’s reachable from all other ranks. In the example here, rank 1 was trying to connect to rank 0 over 127.0.0.1.
I was running it on one machine with 4 cards in it( trying to train only on 2). I fixed my problem by installing and using nvidia Apex(apex.parallel.multiproc).
Not sure why I had to do this, because I’ve seen people use the same script without any hacks like this.
Very odd. Especially since Apex also uses torch.distributed under the hood.