(Running on the latest pytorch nightly)
I am attempting to implement distributed RL training setup with batched inference (similar to Implementing Batch RPC Processing Using Asynchronous Executions — PyTorch Tutorials 1.8.0 documentation). I have working setup, with a small number of RPCs per process (12 processes, with 15 “play_game” RPCs per process active at once).
However, when I attempt to increase the number of games played simultaneously by the worker processes (from 15 to 16 RPCs), instead it freezes, eventually outputting the error
[E thread_pool.cpp:112] Exception in thread pool task: The RPC has not succeeded after the specified number of max retries (5).
hundreds of times after several minutes.
The strange thing is that 15 RPCs per process consistently succeeds, while 16 RPCs per process consistently fails. Is this a limit on the number of RPCs that can be in flight?
The test I am running is available at stone_ground_hearth_battles/test_pytorch_distributed.py at master · JDBumgardner/stone_ground_hearth_battles · GitHub