I am using torch.distributed.rpc library and the RPC agent crashes if I don’t do any RPC call for 30 minutes.
Is this an expected behavior of the RPC library? I think it crashes because it thinks something went wrong when there is no RPC traffic for 30 minutes; however, it times out and crashes even when there was no RPC traffic because my code was written that way!
Below is a code snippet which can reproduce the problematic behavior. If waiting for 30 minutes is too long to test, you can change the hardcoded value in /rpc/backend_registry.py file (detailed in the comment).
import torch.distributed.rpc as rpc
from torch.multiprocessing import Process
import os
import time
def do_nothing():
pass
def test(rank, size):
rpc.init_rpc("Rank"+str(rank), rank=rank, world_size=size)
print("Rank %s rpc init" % rank)
i = 0
# To test easily, I changed <PATH_TO_TORCH_LIB>/torch/distributed/rpc/backend_registry.py.
# Under def _process_group_init_backend_handler(),
# I changed the below line
# >> process_group_timeout = rpc_constants.DEFAULT_PROCESS_GROUP_TIMEOUT
# (which makes the timeout 30 minutes), to somewhat shorter value, e.g.,
# >> process_group_timeout = datetime.timedelta(seconds=10).
# Otherwise, if I wait for 30 min the problem still occurs.
## Loop that does not do anything for a long time...
while i < 10:
time.sleep(1)
print("Rank %s %s sec passed..." % (rank, i))
## Uncommenting the below two lines makes the crash go away!
## I.e., generating some RPC traffic.
#target = rank ^ 0x1
#rpc.rpc_sync("Rank"+str(target), do_nothing)
i += 1
rpc.shutdown()
print("Rank %s rpc shutdown" % rank)
pass
if __name__ == "__main__":
os.environ['MASTER_ADDR'] = "localhost"
os.environ['MASTER_PORT'] = "29502"
processes = []
for rank in [0,1]:
p = Process(target=test, args=(rank, 2, ))
p.start()
processes.append(p)
for p in processes:
p.join()
The error message is:
[E process_group_agent.cpp:664] Encountered exception in ProcessGroupAgent::listenLoop(): [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 10000ms for recv operation to complete on worker 1. This means that the RPC agent is in an unhealthy state and unusable.
I wonder if this is a bug, an expected behavior, or if I am using the API in an incorrect way. If it is an expected behavior, is there any workaround?
I am mainly experiencing this behavior because my code has a process that does not do any RPC calls, but instead calls functions under pytorch.distributed, such as distributed.all_reduce().
I first tried not initializing rpc at all in those processes, but instead calling distributed.init_process_group().
However, this made rpc.init_rpc() calls in other processes to hang or crash; I am suspecting that the problem happens because rpc.init_rpc() calls distributed.init_process_group() internally and somehow they don’t play well when some processes call init_process_group() via init_rpc() and others call it directly…
If rpc timing out after 30 min is an expected behavior, maybe I need to find a way to make some processes to call rpc.init_rpc() and others to call distributed.init_process_group() without failure.
Thank you in advance.