how distributed.rpc package manages nodes

I have several questions about distributed.rpc package, hope someone could shed a light on them, thanks:

  1. Looks like during rpc init(init_rpc), all nodes need to call init_rpc then init will be finished, otherwise all nodes will block waiting. Is there any reason for that behavior? In some scenario, when main node get initialized and some (not all) workers got initialized, the main node should able to send some work to workers and no need to wait for all workers to be initialized.
  2. Does rpc support adds/removes nodes after all nodes got initialized?
  3. What will happen if some node got disconnected after all nodes got initialized? Will the rest of nodes blocking waiting or other nodes will work as normal?
  4. Does rpc have any fail recover built in or plan to introduce in future?

Hey @frankdong, thanks for the questions. Could you please share some more info about your use case? We would love to learn your requirements and use that as input for elastic RPC design.

Looks like during rpc init(init_rpc), all nodes need to call init_rpc then init will be finished, otherwise all nodes will block waiting. Is there any reason for that behavior? In some scenario, when main node get initialized and some (not all) workers got initialized, the main node should able to send some work to workers and no need to wait for all workers to be initialized.

This is a limitation of the current RPC implementation. The main reason is because early versions of RPC uses ProcessGroupGloo to do communication, which requires a rendezvous to initialize. We are working on adding elastic support (nodes can join/leave dynamically), and hopeful can address this problem. TensorPipe RPC backend is one step towards that, but the current version still has some dependency on ProcessGroupGloo.

cc @agolynski for elasticity
cc @lcw for TensorPipe

Does rpc support adds/removes nodes after all nodes got initialized?

Not yet, but this is in our roadmap.

What will happen if some node got disconnected after all nodes got initialized? Will the rest of nodes blocking waiting or other nodes will work as normal?
Does rpc have any fail recover built in or plan to introduce in future?

If the network failure is transient, it should be fine. It will retry internal control messages automatically. If the message contains user function and got lost during the network failure, the sender should see an timeout error.

If it’s permanent network failure or node failure, the current version cannot handle those. We do have plan to cover this gap. There are some relevant discussion here: [RFC] torch.distributed.app (role-based + higher-level RPC APIs) · Issue #41425 · pytorch/pytorch · GitHub

@mrshenli Thanks for your detailed answer.
Do we have rough timeline for elastic rpc?
Regarding our scenario: We are using rpc package as the distribute infrastructure. We have 1 scheduler node that is state-full and will split jobs and send out to workers. Workers are state-less and will receive job from scheduler and report result back to scheduler as soon as it finishes execution. Then scheduler will decide whether we have more jobs to send out and check if there is any idle workers and coordinate the whole execution process. So we will need our distribute infrastructure to be flexible enough to manage all distributed workers and handle node failure properly.