Ease development by running computations on remote GPU

Hello!

For development I use local machine with no GPU and have a remote machine with GPU.

I like to debug my code via IDE tools but also want to have access to gpu.

Using something a-la vscode over ssh is kinda slow so I want to run my scripts locally but send some computations to remote machine.

Ideal variant

# pytorch will connect to remote machine and start a process for GPU computation there
rpc_init("server_addr")

# all computations with model.parameters() will automagically execute on remote machine
model = Linear(3, 1).to("remote-gpu")

data = [
    (Tensor([1, 2, 3]), 1),  # may call .to("remote-gpu") as well 
    (Tensor([4, 5, 6]), 2),  # not too bad
]

# data will be automagically sent to remote machine inside model.__call__()
# or it is already there if used Tensor.to("remote-gpu")
for (sample, label) in :
    result = model(sample)
    loss = compute_loss(label, result)
    # this is done on remote machine as well
    optimizer.step()

So I will run python script.py on local machine and use my local debugging tools and all the code will be run locally except somewhere deep tensor operations will do rpc calls to remote gpu to calculate and then execution will be on my machine again.

Is there an easy API in torch.distributed.rpc to achieve this? If not easy, how can achieve this with current API?

Hi Lain!

May I recommend that you run “emacs” on your remote machine? If you
don’t have an X Server on your local machine (or you feel that X Display
is too slow), you can run “emacs -nw” in a terminal window.

Best.

K. Frank

2 Likes

There are two options:

  1. A higher-level API RemoteModule (recommended):
    Distributed RPC Framework — PyTorch master documentation

You don’t need to explicitly write any RPC. Instead, you need to override the forward method. When you construct this nn.Module like module, need to explicitly specify the device placement (on a remote GPU in your case). The input tensors will be automatically placed to the same cuda device.

Another example can be found here: Combining Distributed DataParallel with Distributed RPC Framework — PyTorch Tutorials 1.8.1+cu102 documentation

  1. A lower-level API RRef:
    Getting Started with Distributed RPC Framework — PyTorch Tutorials 1.8.1+cu102 documentation

You need to write your own RPC, and call to_here() to run your remote module in the RPC.

1 Like

@KFrank, thanks for answering! But… I don’t like this solution for several reasons:

  1. I usually use neovim as an IDE and I tried running it on remote machine and connect via ssh. It was unbearably slow. Maybe upgrading my local connection speed would resolve this problem, I should try.

  2. This way I have to either sync all my dev environment between two machines or migrate to remote machine fully. It’s not a convenient solution because I don’t want to store my personal projects-related code on remote machine.

Also I don’t think TUI IDE over ssh is much better than VSCode over ssh. Maybe something like mosh can help as it is more terminal-application-friendly but last time I used it it messed with syntax highlighting :cry:

@wayi Thanks for answering! I guess your first option is what I need but I have several questions about it.

  1. Documentation on RemoteModule says RemoteModule is not currently supported when using CUDA tensors, but you said tensors will be automatically placed to the same cuda device. Am I missing something? If CUDA tensors are not supported now, where can I track progress on this?

  2. I still need to spawn a remote pytorch process manually every time I start my local process, right? Is there a solution to create a long-living remote process that will consume messages from different local processes?

If not, I can automate my local build process to do something like

ssh remote-host 'cd proj-dir; python remote-worker.py';
python train.py

It’s not very elegant but should work.

  1. Documentation on RemoteModule says RemoteModule is not currently supported when using CUDA tensors, but you said tensors will be automatically placed to the same cuda device. Am I missing something? If CUDA tensors are not supported now, where can I track progress on this?

Thanks for pointing this out! The doc is outdated. Actually CUDA tensors are now supported on TensorPipe backend, documented on the same page. I will update the doc soon.

  1. I still need to spawn a remote pytorch process manually every time I start my local process, right? Is there a solution to create a long-living remote process that will consume messages from different local processes?

You have to initiate both local process(es) and remote workers together every time. This is because at the very beginning a static process group needs to be built, and the remote module(s) will be destroyed if the process group is gone.

What you are asking is more like a treating remote module as a server, and a local process can connect to that server whenever it needs to offload some work to the server. This can cause a problem – if multiple local processes offloads some work to the same remote worker, it will slow down the training.

RPC framework usually works in an opposite way: a local process can be viewed as a master process, and you will try to distribute different modules to different remote workers. Note that a remote module does not really have to be allocated to another machine – it can be on a different device of the same machine. The model parallelism idea is distributing different subsets of a module to different devices, which can be on the same machine or different machines. As a user, you shouldn’t feel any difference in the usage though.

1 Like

ssh remote-host ‘cd proj-dir; python remote-worker.py’;
python train.py

I am not sure this will work in your environment. You still need to make sure different hosts are connected, so you probably need to use sth like SLURM to deploy multi-node training.

1 Like

Thanks for your answer! Now I start to see why it’s not that convenient to fulfill my use case with current framework solution.

For me I will have only one “main” and one “remote” worker so it’s not hard. But for general rpc it’s not good.

RPC framework is mainly used for model parallelism. It seems that your use case is very different from this purpose.

Update: We have an ongoing project called elastic RPC, which should be able to work for your use case.