Hello, I’m trying to run RPC with multiples nodes to implement model parallelism. When using only 1 node (local execution) everything works OK. However when trying to use other nodes with torchrun, the script hangs in rpc.init_rpc.
I have checked other threads and solutions, but usually they are in a 1 node with multiples GPUs (like this [Distributed: RPC] Failed to initialize RPC with >18 workers · Issue #85607 · pytorch/pytorch · GitHub and this is not my case, only 1 GPU per node) or they use multiple nodes + multiple GPUs but the question and solution doesn’t include any minimal code (Torchrun launched jobs hang on multiple machines) So I cannot really compare.
Applying this “hack” (Remove the dependency of Gloo from RPC) I manage to get it execute, but with some errors at the end (Highly probable because a race condition when finalizing the workers) Therefore, I’m pretty sure the problem is related with ProcessGroupGloo.
I’m using my university cluster, so I don’t have admin privileges and most of ports are closed for security reasons.
My minimal code is the same as the Github issue:
# rpc_test.py
# https://github.com/pytorch/pytorch/issues/85607
import os
import random
import numpy as np
import torch
import torch.distributed.rpc as rpc
def worker_init():
rank = int(os.environ['RANK'])
random.seed(rank)
np.random.seed(rank)
torch.manual_seed(rank)
print(f'Rank {rank}')
def main():
rpc_backend_options = rpc.TensorPipeRpcBackendOptions(
init_method='tcp://compute-2-2:53555'
)
rank = int(os.environ['RANK'])
world_size = int(os.environ['WORLD_SIZE'])
rpc.init_rpc(name=f'worker{rank}', rank=rank, world_size=world_size,
rpc_backend_options=rpc_backend_options
)
print("pos init")
worker_init()
# no-op
rpc.shutdown()
if __name__ == '__main__':
main()
If not using rpc_backend_options: Gloo tries to use a forbidden port.
I searched and apparently, I might be able to tell Gloo which port use with the rpc_backend_options
parameter. However, when using it (as in the code) with a valid port, the process is there indefinitely.
I run the scripts in both nodes with:
Node 0: torchrun --nnodes=2 --nproc_per_node=1 --node_rank=0 --rdzv_id=0 --rdzv_endpoint=compute-2-2:53554 minimal-rpc.py
Node 1: torchrun --nnodes=2 --nproc_per_node=1 --node_rank=1 --rdzv_id=0 --rdzv_endpoint=compute-2-2:53554 minimal-rpc.py
I have tried also without the rdzv params and just using master_address and master_port but nothing seems to work.
I can confirm that the torchrun works, because I have successfully executed other scripts with torchrun + DDP between these machines, but DDP doesn’t solve my memory, so I want to try with model parallelism.