For development I use local machine with no GPU and have a remote machine with GPU.
I like to debug my code via IDE tools but also want to have access to gpu.
Using something a-la vscode over ssh is kinda slow so I want to run my scripts locally but send some computations to remote machine.
Ideal variant
# pytorch will connect to remote machine and start a process for GPU computation there
rpc_init("server_addr")
# all computations with model.parameters() will automagically execute on remote machine
model = Linear(3, 1).to("remote-gpu")
data = [
(Tensor([1, 2, 3]), 1), # may call .to("remote-gpu") as well
(Tensor([4, 5, 6]), 2), # not too bad
]
# data will be automagically sent to remote machine inside model.__call__()
# or it is already there if used Tensor.to("remote-gpu")
for (sample, label) in :
result = model(sample)
loss = compute_loss(label, result)
# this is done on remote machine as well
optimizer.step()
So I will run python script.py on local machine and use my local debugging tools and all the code will be run locally except somewhere deep tensor operations will do rpc calls to remote gpu to calculate and then execution will be on my machine again.
Is there an easy API in torch.distributed.rpc to achieve this? If not easy, how can achieve this with current API?
May I recommend that you run âemacsâ on your remote machine? If you
donât have an X Server on your local machine (or you feel that X Display
is too slow), you can run âemacs -nwâ in a terminal window.
You donât need to explicitly write any RPC. Instead, you need to override the forward method. When you construct this nn.Module like module, need to explicitly specify the device placement (on a remote GPU in your case). The input tensors will be automatically placed to the same cuda device.
@KFrank, thanks for answering! But⌠I donât like this solution for several reasons:
I usually use neovim as an IDE and I tried running it on remote machine and connect via ssh. It was unbearably slow. Maybe upgrading my local connection speed would resolve this problem, I should try.
This way I have to either sync all my dev environment between two machines or migrate to remote machine fully. Itâs not a convenient solution because I donât want to store my personal projects-related code on remote machine.
Also I donât think TUI IDE over ssh is much better than VSCode over ssh. Maybe something like mosh can help as it is more terminal-application-friendly but last time I used it it messed with syntax highlighting
@wayi Thanks for answering! I guess your first option is what I need but I have several questions about it.
Documentation on RemoteModule says RemoteModule is not currently supported when using CUDA tensors, but you said tensors will be automatically placed to the same cuda device. Am I missing something? If CUDA tensors are not supported now, where can I track progress on this?
I still need to spawn a remote pytorch process manually every time I start my local process, right? Is there a solution to create a long-living remote process that will consume messages from different local processes?
If not, I can automate my local build process to do something like
Documentation on RemoteModule says RemoteModule is not currently supported when using CUDA tensors, but you said tensors will be automatically placed to the same cuda device. Am I missing something? If CUDA tensors are not supported now, where can I track progress on this?
Thanks for pointing this out! The doc is outdated. Actually CUDA tensors are now supported on TensorPipe backend, documented on the same page. I will update the doc soon.
I still need to spawn a remote pytorch process manually every time I start my local process, right? Is there a solution to create a long-living remote process that will consume messages from different local processes?
You have to initiate both local process(es) and remote workers together every time. This is because at the very beginning a static process group needs to be built, and the remote module(s) will be destroyed if the process group is gone.
What you are asking is more like a treating remote module as a server, and a local process can connect to that server whenever it needs to offload some work to the server. This can cause a problem â if multiple local processes offloads some work to the same remote worker, it will slow down the training.
RPC framework usually works in an opposite way: a local process can be viewed as a master process, and you will try to distribute different modules to different remote workers. Note that a remote module does not really have to be allocated to another machine â it can be on a different device of the same machine. The model parallelism idea is distributing different subsets of a module to different devices, which can be on the same machine or different machines. As a user, you shouldnât feel any difference in the usage though.
I am not sure this will work in your environment. You still need to make sure different hosts are connected, so you probably need to use sth like SLURM to deploy multi-node training.