I ran the test code in https://pytorch.org/docs/1.13/rpc.html on two different machines and got the following errors:
worker0:
python torchrpc_master.py
[W tensorpipe_agent.cpp:530] RPC agent for worker0 encountered error when accepting incoming pipe: eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
[W tensorpipe_agent.cpp:726] RPC agent for worker0 encountered error when reading incoming request from worker1: eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
worker1:
python torchrpc_slave.py
[W tensorpipe_agent.cpp:940] RPC agent for worker1 encountered error when reading incoming response from worker0: transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:478)
Traceback (most recent call last):
File "/code/cityeyes/torchrpc_slave.py", line 7, in <module>
rpc.init_rpc("worker1", rank=1, world_size=2)
File "/env/anaconda3/envs/nerf/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
_init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
File "/env/anaconda3/envs/nerf/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
rpc_agent = backend_registry.init_backend(
File "/env/anaconda3/envs/nerf/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in init_backend
return backend.value.init_backend_handler(*args, **kwargs)
File "/env/anaconda3/envs/nerf/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", line 360, in _tensorpipe_init_backend_handler
api._all_gather(None, timeout=rpc_backend_options.rpc_timeout)
File "/env/anaconda3/envs/nerf/lib/python3.9/site-packages/torch/distributed/rpc/api.py", line 82, in wrapper
return func(*args, **kwargs)
File "/env/anaconda3/envs/nerf/lib/python3.9/site-packages/torch/distributed/rpc/api.py", line 224, in _all_gather
rpc_sync(
File "/env/anaconda3/envs/nerf/lib/python3.9/site-packages/torch/distributed/rpc/api.py", line 82, in wrapper
return func(*args, **kwargs)
File "/env/anaconda3/envs/nerf/lib/python3.9/site-packages/torch/distributed/rpc/api.py", line 809, in rpc_sync
return fut.wait()
RuntimeError: transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:478)
But the same code runs fine on the same machine. I want to know what causes it?
This is my code:
torchrpc_master.py
import os
import torch
import torch.distributed as dist
import torch.distributed.rpc as rpc
os.environ['MASTER_ADDR'] = '192.168.211.12'
os.environ['MASTER_PORT'] = '5678'
rpc.init_rpc("worker0", rank=0, world_size=2)
ret = rpc.rpc_sync("worker1", torch.add, args=(torch.ones(2), 3))
print(ret)
rpc.shutdown()
torchrpc_master.py
import os
import torch.distributed as dist
import torch.distributed.rpc as rpc
import time
os.environ['MASTER_ADDR'] = '192.168.211.12'
os.environ['MASTER_PORT'] = '5678'
rpc.init_rpc("worker1", rank=1, world_size=2)
rpc.shutdown()
torch version: 1.13.1
Any help would be appreciated.