RPC + Torchrun hangs in ProcessGroupGloo

patataman · February 14, 2024, 1:18pm

Hello, I’m trying to run RPC with multiples nodes to implement model parallelism. When using only 1 node (local execution) everything works OK. However when trying to use other nodes with torchrun, the script hangs in rpc.init_rpc.

I have checked other threads and solutions, but usually they are in a 1 node with multiples GPUs (like this [Distributed: RPC] Failed to initialize RPC with >18 workers · Issue #85607 · pytorch/pytorch · GitHub and this is not my case, only 1 GPU per node) or they use multiple nodes + multiple GPUs but the question and solution doesn’t include any minimal code (Torchrun launched jobs hang on multiple machines) So I cannot really compare.

Applying this “hack” (Remove the dependency of Gloo from RPC) I manage to get it execute, but with some errors at the end (Highly probable because a race condition when finalizing the workers) Therefore, I’m pretty sure the problem is related with ProcessGroupGloo.

I’m using my university cluster, so I don’t have admin privileges and most of ports are closed for security reasons.

My minimal code is the same as the Github issue:

# rpc_test.py
# https://github.com/pytorch/pytorch/issues/85607

import os
import random

import numpy as np
import torch

import torch.distributed.rpc as rpc


def worker_init():
    rank = int(os.environ['RANK'])

    random.seed(rank)
    np.random.seed(rank)
    torch.manual_seed(rank)
    print(f'Rank {rank}')


def main():
    rpc_backend_options = rpc.TensorPipeRpcBackendOptions(
        init_method='tcp://compute-2-2:53555'
    )

    rank = int(os.environ['RANK'])
    world_size = int(os.environ['WORLD_SIZE'])
    
    rpc.init_rpc(name=f'worker{rank}', rank=rank, world_size=world_size,
        rpc_backend_options=rpc_backend_options
    )
    print("pos init")
    worker_init()
    
    # no-op
    rpc.shutdown()


if __name__ == '__main__':
    main()

If not using rpc_backend_options: Gloo tries to use a forbidden port.

I searched and apparently, I might be able to tell Gloo which port use with the rpc_backend_options parameter. However, when using it (as in the code) with a valid port, the process is there indefinitely.

I run the scripts in both nodes with:
Node 0: torchrun --nnodes=2 --nproc_per_node=1 --node_rank=0 --rdzv_id=0 --rdzv_endpoint=compute-2-2:53554 minimal-rpc.py
Node 1: torchrun --nnodes=2 --nproc_per_node=1 --node_rank=1 --rdzv_id=0 --rdzv_endpoint=compute-2-2:53554 minimal-rpc.py

I have tried also without the rdzv params and just using master_address and master_port but nothing seems to work.

I can confirm that the torchrun works, because I have successfully executed other scripts with torchrun + DDP between these machines, but DDP doesn’t solve my memory, so I want to try with model parallelism.

patataman · February 14, 2024, 4:30pm

Seems like using the solution proposed here Getting Gloo error when connecting server and client over VPN from different systems - #2 by selineni fix the hanging…

Nothing like asking after fighting with the problem for a few hours to find a solution short after.

EDIT —

For completeness:

# rpc_test.py
# https://github.com/pytorch/pytorch/issues/85607

import os
import random

import numpy as np
import torch

import torch.distributed.rpc as rpc


def worker_init():
    rank = int(os.environ['RANK'])

    random.seed(rank)
    np.random.seed(rank)
    torch.manual_seed(rank)
    print(f'Rank {rank}')


def main():
    rank = int(os.environ['RANK'])
    world_size = int(os.environ['WORLD_SIZE'])
    
    print("initing rpc")
    rpc.init_rpc(name=f'worker{rank}', rank=rank, world_size=world_size)
    print("rpc inited - worker init")
    worker_init()
    print("worker inited")
    
    # no-op
    rpc.shutdown()


if __name__ == '__main__':
    main()

And the execution calls

TP_SOCKET_IFNAME=<interface> GLOO_SOCKET_IFNAME=<interface> torchrun --nnodes=2 --nproc_per_node=1 --node_rank=0 --rdzv_id=0 --rdzv_endpoint=compute-2-2:53554 minimal-rpc.py

TP_SOCKET_IFNAME=<interface> GLOO_SOCKET_IFNAME=<interface> torchrun --nnodes=2 --nproc_per_node=1 --node_rank=1 --rdzv_id=0 --rdzv_endpoint=compute-2-2:53554 minimal-rpc.py