Send computation to a remote gpu

xerxex · March 2, 2020, 4:52pm

Is there a way to define a remote GPU device inside our local code?

For example:

local_cpu = torch.device('cpu')
remote_device = ... (?)

model = Model().to(remote_device)
...
inputs = inputs.to(remote_device)
outputs = model(inputs)
outputs = outputs.to(local_cpu)

mrshenli · March 3, 2020, 8:18pm

Hey @xerxex, there is a torch.distributed.rpc package for this purpose. Please refer to the following docs:

API doc: https://pytorch.org/docs/master/rpc.html
Tutorial: https://pytorch.org/tutorials/intermediate/rpc_tutorial.html

For now, we do not yet support creating remote device like torch.device('worker1/cuda0'), but this is on our roadmap and we plan to implement this as a sugar layer on top of RPC. Applications should be able to do the same thing using our raw RPC API.

xerxex · March 4, 2020, 5:16pm

Hi @mrshenli, thanks for the links. Unfortunately, I could not run the minimal example from the documentation. can you please correct me to run this?

Let’s say I can not proceed without the return value so I need rpc_sync.
I created two python scripts. The first one is:

# On worker 0:
import torch
import torch.distributed.rpc as rpc

rpc.init_rpc("worker0", rank=0, world_size=2)
ret = rpc.rpc_sync("worker1", torch.add, args=(torch.ones(2), 3))
rpc.shutdown()

and the second one is:

# On worker 1:
import torch.distributed.rpc as rpc
rpc.init_rpc("worker1", rank=1, world_size=2)
rpc.shutdown()

Then, when I execute the first script, I run into this error:

  File "process_0.py", line 4, in <module>
    rpc.init_rpc("worker0", rank=0, world_size=2)
  File "/data/anaconda3/lib/python3.7/site-packages/torch/distributed/rpc/__init__.py", line 60, in init_rpc
    init_method, rank=rank, world_size=world_size
  File "/data/anaconda3/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 48, in rendezvous
    raise RuntimeError("`url` must be a string. {}: {}".format(type(url), url))
RuntimeError: `url` must be a string. <class 'NoneType'>: None

mrshenli · March 4, 2020, 7:02pm

Have you set the master address and port for Gloo ProcessGroup? Sth like:

    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '29500'

xerxex · March 4, 2020, 9:28pm

@mrshenli, I added as you said, but still the same error.
The error comming from this line:

rpc.init_rpc("worker0", rank=0, world_size=2)

mrshenli · March 6, 2020, 3:35pm

Hey @xerxex, which version of PyTorch are you using?

  File "/data/anaconda3/lib/python3.7/site-packages/torch/distributed/rpc/__init__.py", line 60, in init_rpc
    init_method, rank=rank, world_size=world_size

Given this line above, it does seem to be v1.4 nor the current master branch.

RPC is only available since v1.4.

xerxex · March 6, 2020, 5:01pm

@mrshenli, my version is 1.4.0a0+a5272cb

mrshenli · March 6, 2020, 10:01pm

I see, that’s the commit prior to the official v1.4.0 release, and that why the code from the error message looks different from v1.4.0.

In the version you are using (Nov 22nd, 2019), the init_rpc API takes an init_method arg, which you need to set. It is the same init_method as how you would call init_process_group.

It will be easier if you switch to official v1.4 or the current master.

mrshenli · March 6, 2020, 10:04pm

Honestly, we wouldn’t recommend using versions prior to v1.4.0, the API and behavior of RPC package are only officially announced as experimental in v1.4.0. So, even if you can get around init_rpc using your current PyTorch version by setting init_method, you might run into other issues later.