When I use 1024 nodes in rpc, I meet RuntimeError "listen: Address already in use"

rpc.init_rpc('env_{}'.format(rank), rank=rank, world_size=opt.world_size, rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(rpc_timeout=100))

if I use 128 nodes, it works.
but when I use 1024 nodes (32 servers * 32 processes or 16 servers * 64 processes), I meet

...
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:184] listen: Address already in use

my environment is pytorch 1.6.0, python 3.7, cuda 10.1.
is there anyone have met it before ?

Hey @yueyilia, I haven’t seen this error before. Which version of PyTorch are you using?

my environment is pytorch 1.6.0, python 3.7, cuda 10.1.

Does tensorpipe RPC backend work in this case? It is still experimental, but we plan to make it the default backend and focus on it in future releases.

Regarding the error, I suspect this is a limitation in Gloo. I don’t have enough resource to verify this locally. If possible, can you try if init_process_group and all_reduce work with 1024 nodes?

I replace process group with tensorpipe, I also meet

...
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:184] listen: Address already in use

It seems that tensorpipe still depends on gloo.
I run init_process_group and all_reduce, and the error is the same.

It seems that process group needs too many ports. It will establish a connection between every two nodes, so that each server needs 32*1024 ports (32 servers * 32 processes) to establish a tcp connection. Do you have any plans to optimize it?

It seems that tensorpipe still depends on gloo.

It’s true for now, but only for initialization and shutdown. We will remove its dependency on gloo in future releases. (hopefully in v1.8)

It seems that process group needs too many ports. It will establish a connection between every two nodes, so that each server needs 32*1024 ports (32 servers * 32 processes) to establish a tcp connection. Do you have any plans to optimize it?

Good catch! I am not aware of any plan to improve gloo for this. cc @pbelevich

For the RPC Framework it seems like this is happening since Gloo creates a tcp connection for all combination of processes in the group.

I’m wondering if this can be avoid in TensorPipe where the TCP connections are created on demand and kept in a pool for reuse. Typically in a RPC environment, we’re not talking to all the nodes in the group at the same time.

@yueyilia Could you add some details about your use case for RPC here? Are all the nodes (1024) communicating with all the other nodes in your application at the same time? Is it possible to run 1 process per server in your application to get around this in the short term? If GIL is currently the bottleneck, there is some amount of TorchScript support in the RPC framework that might help getting round GIL.

cc: @lcw re creating TCP connections on demand in TensorPipe

TensorPipe would fare better if indeed your topology (i.e., the links you actually use) are a significantly smaller subset than the complete graph. (For example, if you use a server-client pattern). In other words, if your graph is sparse. For dense (near-complete) graphs TensorPipe will perform even worse than Gloo because each pipe will internally use multiple TCP connections, whereas Gloo uses only one.

The reason you’re currently unable to use TensorPipe is because, indeed, it uses Gloo internally for the join step. We’ve been wanting to get rid of this for a while but it’s hard, because the RPC agent’s join must do a barrier, and it’s easier to do it through a library that already does collectives (namely Gloo) rather than re-implement it. We could use the c10d::Store instead of the ProcessGroup for that, but currently the Store isn’t powerful enough. @osalpekar was thinking of refactoring it though so maybe then we could do this change. See https://github.com/pytorch/pytorch/issues/42879 and https://github.com/pytorch/pytorch/issues/41614 for more context.

1 Like