I am running RPC with 3 nodes. In my code, master node is successfully able to call worker1’s and worker2’s forward functions and get the results back. After that, loss backprop step is executed on the master node, which takes quite some time, due to that I am getting below error on master node,
dist_autograd.backward(context_id, [losses])
RuntimeError: Error on Node 0: ETIMEDOUT: connection timed out
On the worker nodes I am getting following output,
Failed to respond to 'Shutdown Proceed' in time, got error RPCErr:1:RPC ran for more than set timeout (5000 ms) and will now be marked with an error.
[W tensorpipe_agent.cpp:687] RPC agent for worker2 encountered error when sending outgoing request #92 to master: ETIMEDOUT: connection timed out
<above line many times>
Process Process-1:
[W tensorpipe_agent.cpp:545] RPC agent for worker2 encountered error when reading incoming request from master: ECONNRESET: connection reset by peer (this is expected to happen during shutdown)
EDIT:
StackTrace:
Process Process-1:
Traceback (most recent call last):
File "/home/user/anaconda3/envs/pytorch2/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/user/anaconda3/envs/pytorch2/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "main.py", line 70, in workers_init_rpc
rpc.shutdown()
File "/home/user/anaconda3/envs/pytorch2/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 78, in wrapper
return func(*args, **kwargs)
File "/home/user/anaconda3/envs/pytorch2/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 284, in shutdown
_get_current_rpc_agent().join()
RuntimeError: [/opt/conda/conda-bld/pytorch_1607370156314/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [192.168.13.205]:28380
Init RPC code:
def workers_init_rpc(rank, world_size, options):
# options = rpc.ProcessGroupRpcBackendOptions(num_send_recv_threads=128,
# rpc_timeout=0,
# init_method="tcp://192.168.13.46:2222" )
print(f'Rank {rank}: Proceed to init rpc')
rpc.init_rpc(
f"worker{rank}",
rank=rank,
world_size=world_size,
rpc_backend_options=options
)
print(f'Rank: {rank}, rpc init done')
if rank == 0:
print('Proceed to run_master')
run_master()
# block until all rpcs finish
rpc.shutdown()
if __name__=="__main__":
world_size = 3
processes = []
options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=128, rpc_timeout = 10*60)
rank = int(sys.argv[1])
p = mp.Process(target=workers_init_rpc, args=(rank, world_size, options))
p.start()
processes.append(p)
# mp.spawn(run_worker, args=(world_size,), nprocs=world_size, join=True)
for p in processes:
p.join()
I have tried to increase rpc_timeout parameter in TensorPipeRpcBackendOptions. But it’s not working. How should I keep the connection ON for longer time durations?