RPC - dynamic world size

Is it possible to have a dynamic world size when using torch.distributed.rpc?
I want to have a changing number of processes communicating using the TensorPipe backend, without explicitly stating a world size, having each process dynamicaly assigned a rank.

Hey @ItamarWilf,

Unfortunately, this is not yet possible with the RPC package, but it is in our roadmap.

Hi Shen. Is this feature available now? If not, will it be available in the near future?

Hi @sagebei, we have a prototype of this feature in the PyTorch 1.12 release which will be available on June 28 (view releases here: Releases · pytorch/pytorch · GitHub). Or feel free to pull the nightly PyTorch build.

As part of this feature, an RPC processes init_rpc remains the same; however, when the world_size argument is not specified then this is assumed to be a dynamic group which allows processes to join and leave the group. shutdown() is used to leave the group. We do not currently support dynamic rank allocation. Documentation will be updated for this as well.

Current

# blocking join for all processes
init_rpc("workerN", world_size=N, rank=N)

# blocking shutdown for all processes
shutdown(graceful=True)
# nonblocking shutdown for a single processes
shutdown(graceful=False)

New

# node join
init_rpc("worker0", rank=0)

# node leave
rpc.shutdown()

Curious, can you provide a few details on how you will be using this feature and the scenarios / architectures / models you are training? Thanks!