For now, we do not yet support creating remote device like torch.device('worker1/cuda0'), but this is on our roadmap and we plan to implement this as a sugar layer on top of RPC. Applications should be able to do the same thing using our raw RPC API.
Hi @mrshenli, thanks for the links. Unfortunately, I could not run the minimal example from the documentation. can you please correct me to run this?
Let’s say I can not proceed without the return value so I need rpc_sync.
I created two python scripts. The first one is:
# On worker 0:
import torch
import torch.distributed.rpc as rpc
rpc.init_rpc("worker0", rank=0, world_size=2)
ret = rpc.rpc_sync("worker1", torch.add, args=(torch.ones(2), 3))
rpc.shutdown()
and the second one is:
# On worker 1:
import torch.distributed.rpc as rpc
rpc.init_rpc("worker1", rank=1, world_size=2)
rpc.shutdown()
Then, when I execute the first script, I run into this error:
File "process_0.py", line 4, in <module>
rpc.init_rpc("worker0", rank=0, world_size=2)
File "/data/anaconda3/lib/python3.7/site-packages/torch/distributed/rpc/__init__.py", line 60, in init_rpc
init_method, rank=rank, world_size=world_size
File "/data/anaconda3/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 48, in rendezvous
raise RuntimeError("`url` must be a string. {}: {}".format(type(url), url))
RuntimeError: `url` must be a string. <class 'NoneType'>: None
Hey @xerxex, which version of PyTorch are you using?
File "/data/anaconda3/lib/python3.7/site-packages/torch/distributed/rpc/__init__.py", line 60, in init_rpc
init_method, rank=rank, world_size=world_size
Given this line above, it does seem to be v1.4 nor the current master branch.
I see, that’s the commit prior to the official v1.4.0 release, and that why the code from the error message looks different from v1.4.0.
In the version you are using (Nov 22nd, 2019), the init_rpc API takes an init_method arg, which you need to set. It is the same init_method as how you would call init_process_group.
It will be easier if you switch to official v1.4 or the current master.
Honestly, we wouldn’t recommend using versions prior to v1.4.0, the API and behavior of RPC package are only officially announced as experimental in v1.4.0. So, even if you can get around init_rpc using your current PyTorch version by setting init_method, you might run into other issues later.