Hey @frankdong, thanks for the questions. Could you please share some more info about your use case? We would love to learn your requirements and use that as input for elastic RPC design.
Looks like during rpc init(init_rpc), all nodes need to call init_rpc then init will be finished, otherwise all nodes will block waiting. Is there any reason for that behavior? In some scenario, when main node get initialized and some (not all) workers got initialized, the main node should able to send some work to workers and no need to wait for all workers to be initialized.
This is a limitation of the current RPC implementation. The main reason is because early versions of RPC uses
ProcessGroupGloo to do communication, which requires a rendezvous to initialize. We are working on adding elastic support (nodes can join/leave dynamically), and hopeful can address this problem. TensorPipe RPC backend is one step towards that, but the current version still has some dependency on
cc @agolynski for elasticity
cc @lcw for TensorPipe
Does rpc support adds/removes nodes after all nodes got initialized?
Not yet, but this is in our roadmap.
What will happen if some node got disconnected after all nodes got initialized? Will the rest of nodes blocking waiting or other nodes will work as normal?
Does rpc have any fail recover built in or plan to introduce in future?
If the network failure is transient, it should be fine. It will retry internal control messages automatically. If the message contains user function and got lost during the network failure, the sender should see an timeout error.
If it’s permanent network failure or node failure, the current version cannot handle those. We do have plan to cover this gap. There are some relevant discussion here: https://github.com/pytorch/pytorch/issues/41425