I am using torch.distributed.rpc library and the RPC agent crashes if I don’t do any RPC call for 30 minutes.
Is this an expected behavior of the RPC library? I think it crashes because it thinks something went wrong when there is no RPC traffic for 30 minutes; however, it times out and crashes even when there was no RPC traffic because my code was written that way!
Below is a code snippet which can reproduce the problematic behavior. If waiting for 30 minutes is too long to test, you can change the hardcoded value in /rpc/backend_registry.py file (detailed in the comment).
import torch.distributed.rpc as rpc from torch.multiprocessing import Process import os import time def do_nothing(): pass def test(rank, size): rpc.init_rpc("Rank"+str(rank), rank=rank, world_size=size) print("Rank %s rpc init" % rank) i = 0 # To test easily, I changed <PATH_TO_TORCH_LIB>/torch/distributed/rpc/backend_registry.py. # Under def _process_group_init_backend_handler(), # I changed the below line # >> process_group_timeout = rpc_constants.DEFAULT_PROCESS_GROUP_TIMEOUT # (which makes the timeout 30 minutes), to somewhat shorter value, e.g., # >> process_group_timeout = datetime.timedelta(seconds=10). # Otherwise, if I wait for 30 min the problem still occurs. ## Loop that does not do anything for a long time... while i < 10: time.sleep(1) print("Rank %s %s sec passed..." % (rank, i)) ## Uncommenting the below two lines makes the crash go away! ## I.e., generating some RPC traffic. #target = rank ^ 0x1 #rpc.rpc_sync("Rank"+str(target), do_nothing) i += 1 rpc.shutdown() print("Rank %s rpc shutdown" % rank) pass if __name__ == "__main__": os.environ['MASTER_ADDR'] = "localhost" os.environ['MASTER_PORT'] = "29502" processes =  for rank in [0,1]: p = Process(target=test, args=(rank, 2, )) p.start() processes.append(p) for p in processes: p.join()
The error message is:
[E process_group_agent.cpp:664] Encountered exception in ProcessGroupAgent::listenLoop(): [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 10000ms for recv operation to complete on worker 1. This means that the RPC agent is in an unhealthy state and unusable.
I wonder if this is a bug, an expected behavior, or if I am using the API in an incorrect way. If it is an expected behavior, is there any workaround?
I am mainly experiencing this behavior because my code has a process that does not do any RPC calls, but instead calls functions under pytorch.distributed, such as distributed.all_reduce().
I first tried not initializing rpc at all in those processes, but instead calling distributed.init_process_group().
However, this made rpc.init_rpc() calls in other processes to hang or crash; I am suspecting that the problem happens because rpc.init_rpc() calls distributed.init_process_group() internally and somehow they don’t play well when some processes call init_process_group() via init_rpc() and others call it directly…
If rpc timing out after 30 min is an expected behavior, maybe I need to find a way to make some processes to call rpc.init_rpc() and others to call distributed.init_process_group() without failure.
Thank you in advance.