Torch DDP hangs only at Gloo backend (but not NCCL)

I’m trying out a minimum example that tries to enable point to point communication between two machines using pytorch DDP. But the process hands at the init_process_group when I use gloo backend. The code seems to be working when I’m using nccl backend.

Server

def run(rank, size):
    print(f"[ {os.getpid()} ] world_size = {dist.get_world_size()}, " + f"rank = {dist.get_rank()}, backend={dist.get_backend()}")
    time.sleep(1000)

def init_process(rank, size, fn, backend="gloo"): 
    os.environ['MASTER_ADDR'] = '192.168.0.3' 
    os.environ['MASTER_PORT'] = '29522'
    os.environ['WORLD_SIZE'] = str(size) 
    os.environ['RANK'] = str(rank)
    print(f"Pre Complete initialization {rank} {size}")
    from datetime import timedelta
    server_store = dist.TCPStore("192.168.0.3", 12345, 2, True, timedelta(seconds=30))
    server_store.set("first_key", "first_value")
    dist.init_process_group(backend, store=server_store, rank=rank, world_size=size)
    
    print(f"Complete initialization")
    fn(rank, size)

if __name__ == "__main__":
    size = 2
    processes = []
    p = mp.Process(target=init_process, args=(0, size, run))
    p.start()
    processes.append(p)

    for p in processes:
        p.join()

Client

def run(rank, size):
    print(f"[ {os.getpid()} ] world_size = {dist.get_world_size()}, " + f"rank = {dist.get_rank()}, backend={dist.get_backend()}")

def init_process(rank, size, fn, backend="gloo"): 
    os.environ['MASTER_ADDR'] = '192.168.0.3' 
    os.environ['MASTER_PORT'] = '29522'
    os.environ['WORLD_SIZE'] = str(size) 
    os.environ['RANK'] = str(rank)
    print(f"Pre Complete initialization {rank} {size}")
    client_store = dist.TCPStore("192.168.0.3", 12345, 2, False)
    print(client_store.get("first_key"))
    dist.init_process_group(backend, rank=rank, store=client_store, world_size=size)
    
    print(f"Complete initialization")
    fn(rank, size)

if __name__ == "__main__":
    size = 2
    processes = []
    p = mp.Process(target=init_process, args=(1, size, run))
    p.start()
    processes.append(p)

    for p in processes:
        p.join()

Is there a way to find out why? Thanks!

I am not sure if this is the right way to do it or not.

Have you tried with env[“GLOO_DESYNC_DEBUG”] = 1?

Also cc: @kumpera

Thanks for the reply. Unfortunately running with “GLOO_DESYNC_DEBUG” enabled does not print anything.

I also tried connecting without specifying the store option in init_process_group, and this shows that the connection on the port 29522 has been established. Shown on the server size:

tcp6       0      0 mew0:60752              mew0:29522              ESTABLISHED 618009/python       
tcp6       0      0 mew0:29522              mew3:33722              ESTABLISHED 618009/python       
tcp6       0      0 mew0:60750              mew0:29522              ESTABLISHED 618009/python       
tcp6       0      0 mew0:29522              mew3:33728              ESTABLISHED 618009/python       
tcp6       0      0 mew0:29522              mew0:60750              ESTABLISHED 618009/python       
tcp6       0      0 mew0:29522              mew0:60752              ESTABLISHED 618009/python   

from running netstat. But the init_process_group still hangs. Using nccl works either way so the code only blocks when using gloo. Any suggestion what might be happening? Thanks!

I don’t know about this then… cc: @kumpera do you happen to know what is happening?