Pytorch RPC maximum number of concurrent RPCs?

lcw · March 23, 2021, 3:20pm

From what I can remember the TensorPipe agent does indeed only have one thread pool which is shared for many “purposes”, so what you said is totally reasonable.

I don’t immediately recognize the initial error message you got, as it mentions retries, but I don’t think we support retires in “standard” RPC messages. I think we only support them for RRef-related messages and other internal stuff. @mrshenli Is that the case? @jeremysalwen Are you using RRefs, dist autograd, or other such things? Unfortunately the message doesn’t say what the underlying failure is, but I suspect it could be a timeout?

Also note that as @mrshenli said, it’s an antipattern to synchronously block in a remote RPC function or a callback. Doing so will block a thread and eventually lead to starvation. If you must do so, please ensure you’re doing it for a limited number of RPC calls, and size the thread pool accordingly. However, it would be better to use “native” asynchronous patterns.