Hi,
I found a discussion on dynamic world size of torch rpc and I have some questions related to it. We are developing a distributed graph learning library for PyTorch graphlearn-for-pytorch. Our current goal is to refactor the library’s RPC functionality and make it dynamic. Sampler processes sample mini-batches from a large graph and serve as background service. Trainer processes can join dynamically and request mini-batches from the sampling processes. This will allow us to run different experiments without reloading the graph.
I have the following questions:
- Is dynamic world size a stable and long-term supported feature? While I can identify relevant code changes, I couldn’t find any documentation or tutorials.
- How are some RPC APIs expected to behave in this dynamic scenario? For example, I tested
rpc.api._all_gather
, and found that only one rank can receive the response.
import torch.distributed.rpc as rpc
import time
import multiprocessing as mp
import torch
import os
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '11111'
def worker(rank):
rpc.init_rpc(
name=f'worker_{rank}',
rank=rank,
)
print(f'rank {rank} initialized')
time.sleep(1)
print(rpc.api._all_gather(torch.cuda.device_count()))
if rank == 0:
time.sleep(1)
rpc.shutdown()
print(f'rank {rank} exited')
if __name__ == '__main__':
ranks = [0, 1, 2, 3]
for rank in ranks:
process = mp.Process(target=worker, args=(rank, ))
process.start()