Questions on dynamic world size

Hi,
I found a discussion on dynamic world size of torch rpc and I have some questions related to it. We are developing a distributed graph learning library for PyTorch graphlearn-for-pytorch. Our current goal is to refactor the library’s RPC functionality and make it dynamic. Sampler processes sample mini-batches from a large graph and serve as background service. Trainer processes can join dynamically and request mini-batches from the sampling processes. This will allow us to run different experiments without reloading the graph.

I have the following questions:

  1. Is dynamic world size a stable and long-term supported feature? While I can identify relevant code changes, I couldn’t find any documentation or tutorials.
  2. How are some RPC APIs expected to behave in this dynamic scenario? For example, I tested rpc.api._all_gather, and found that only one rank can receive the response.
import torch.distributed.rpc as rpc
import time
import multiprocessing as mp
import torch
import os

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '11111'

def worker(rank):
    rpc.init_rpc(
        name=f'worker_{rank}',
        rank=rank,
    )
    print(f'rank {rank} initialized')
    time.sleep(1)
    print(rpc.api._all_gather(torch.cuda.device_count()))
    if rank == 0:
        time.sleep(1)
    rpc.shutdown()
    print(f'rank {rank} exited')

if __name__ == '__main__':
    ranks = [0, 1, 2, 3]
    for rank in ranks:
        process = mp.Process(target=worker, args=(rank, ))
        process.start()