When I use 1024 nodes in rpc, I meet RuntimeError "listen: Address already in use"

yueyilia · September 9, 2020, 9:49am

rpc.init_rpc('env_{}'.format(rank), rank=rank, world_size=opt.world_size, rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(rpc_timeout=100))

if I use 128 nodes, it works.
but when I use 1024 nodes (32 servers * 32 processes or 16 servers * 64 processes), I meet

...
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:184] listen: Address already in use

my environment is pytorch 1.6.0, python 3.7, cuda 10.1.
is there anyone have met it before ?

mrshenli · September 9, 2020, 5:50pm

Hey @yueyilia, I haven’t seen this error before. Which version of PyTorch are you using?

yueyilia · September 10, 2020, 1:56am

my environment is pytorch 1.6.0, python 3.7, cuda 10.1.

mrshenli · September 10, 2020, 2:45am

Does tensorpipe RPC backend work in this case? It is still experimental, but we plan to make it the default backend and focus on it in future releases.

Regarding the error, I suspect this is a limitation in Gloo. I don’t have enough resource to verify this locally. If possible, can you try if init_process_group and all_reduce work with 1024 nodes?

yueyilia · September 10, 2020, 3:20am

I replace process group with tensorpipe, I also meet

...
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:184] listen: Address already in use

It seems that tensorpipe still depends on gloo.
I run init_process_group and all_reduce, and the error is the same.

yueyilia · September 10, 2020, 8:24am

It seems that process group needs too many ports. It will establish a connection between every two nodes, so that each server needs 32*1024 ports (32 servers * 32 processes) to establish a tcp connection. Do you have any plans to optimize it?

mrshenli · September 10, 2020, 2:10pm

It seems that tensorpipe still depends on gloo.

It’s true for now, but only for initialization and shutdown. We will remove its dependency on gloo in future releases. (hopefully in v1.8)

It seems that process group needs too many ports. It will establish a connection between every two nodes, so that each server needs 32*1024 ports (32 servers * 32 processes) to establish a tcp connection. Do you have any plans to optimize it?

Good catch! I am not aware of any plan to improve gloo for this. cc @pbelevich

pritamdamania87 · September 17, 2020, 9:29pm

For the RPC Framework it seems like this is happening since Gloo creates a tcp connection for all combination of processes in the group.

I’m wondering if this can be avoid in TensorPipe where the TCP connections are created on demand and kept in a pool for reuse. Typically in a RPC environment, we’re not talking to all the nodes in the group at the same time.

@yueyilia Could you add some details about your use case for RPC here? Are all the nodes (1024) communicating with all the other nodes in your application at the same time? Is it possible to run 1 process per server in your application to get around this in the short term? If GIL is currently the bottleneck, there is some amount of TorchScript support in the RPC framework that might help getting round GIL.

osalpekar · September 17, 2020, 11:10pm

cc: @lcw re creating TCP connections on demand in TensorPipe

lcw · September 18, 2020, 11:10am

TensorPipe would fare better if indeed your topology (i.e., the links you actually use) are a significantly smaller subset than the complete graph. (For example, if you use a server-client pattern). In other words, if your graph is sparse. For dense (near-complete) graphs TensorPipe will perform even worse than Gloo because each pipe will internally use multiple TCP connections, whereas Gloo uses only one.

The reason you’re currently unable to use TensorPipe is because, indeed, it uses Gloo internally for the join step. We’ve been wanting to get rid of this for a while but it’s hard, because the RPC agent’s join must do a barrier, and it’s easier to do it through a library that already does collectives (namely Gloo) rather than re-implement it. We could use the c10d::Store instead of the ProcessGroup for that, but currently the Store isn’t powerful enough. @osalpekar was thinking of refactoring it though so maybe then we could do this change. See https://github.com/pytorch/pytorch/issues/42879 and https://github.com/pytorch/pytorch/issues/41614 for more context.