Does tensorpipe RPC backend work in this case? It is still experimental, but we plan to make it the default backend and focus on it in future releases.
Regarding the error, I suspect this is a limitation in Gloo. I don’t have enough resource to verify this locally. If possible, can you try if init_process_group and all_reduce work with 1024 nodes?
It seems that process group needs too many ports. It will establish a connection between every two nodes, so that each server needs 32*1024 ports (32 servers * 32 processes) to establish a tcp connection. Do you have any plans to optimize it?
It’s true for now, but only for initialization and shutdown. We will remove its dependency on gloo in future releases. (hopefully in v1.8)
It seems that process group needs too many ports. It will establish a connection between every two nodes, so that each server needs 32*1024 ports (32 servers * 32 processes) to establish a tcp connection. Do you have any plans to optimize it?
Good catch! I am not aware of any plan to improve gloo for this. cc @pbelevich
For the RPC Framework it seems like this is happening since Gloo creates a tcp connection for all combination of processes in the group.
I’m wondering if this can be avoid in TensorPipe where the TCP connections are created on demand and kept in a pool for reuse. Typically in a RPC environment, we’re not talking to all the nodes in the group at the same time.
@yueyilia Could you add some details about your use case for RPC here? Are all the nodes (1024) communicating with all the other nodes in your application at the same time? Is it possible to run 1 process per server in your application to get around this in the short term? If GIL is currently the bottleneck, there is some amount of TorchScript support in the RPC framework that might help getting round GIL.
TensorPipe would fare better if indeed your topology (i.e., the links you actually use) are a significantly smaller subset than the complete graph. (For example, if you use a server-client pattern). In other words, if your graph is sparse. For dense (near-complete) graphs TensorPipe will perform even worse than Gloo because each pipe will internally use multiple TCP connections, whereas Gloo uses only one.
The reason you’re currently unable to use TensorPipe is because, indeed, it uses Gloo internally for the join step. We’ve been wanting to get rid of this for a while but it’s hard, because the RPC agent’s join must do a barrier, and it’s easier to do it through a library that already does collectives (namely Gloo) rather than re-implement it. We could use the c10d::Store instead of the ProcessGroup for that, but currently the Store isn’t powerful enough. @osalpekar was thinking of refactoring it though so maybe then we could do this change. See https://github.com/pytorch/pytorch/issues/42879 and https://github.com/pytorch/pytorch/issues/41614 for more context.