Ease development by running computations on remote GPU

wayi · May 14, 2021, 6:47pm

Documentation on RemoteModule says RemoteModule is not currently supported when using CUDA tensors, but you said tensors will be automatically placed to the same cuda device. Am I missing something? If CUDA tensors are not supported now, where can I track progress on this?

Thanks for pointing this out! The doc is outdated. Actually CUDA tensors are now supported on TensorPipe backend, documented on the same page. I will update the doc soon.

I still need to spawn a remote pytorch process manually every time I start my local process, right? Is there a solution to create a long-living remote process that will consume messages from different local processes?

You have to initiate both local process(es) and remote workers together every time. This is because at the very beginning a static process group needs to be built, and the remote module(s) will be destroyed if the process group is gone.

What you are asking is more like a treating remote module as a server, and a local process can connect to that server whenever it needs to offload some work to the server. This can cause a problem – if multiple local processes offloads some work to the same remote worker, it will slow down the training.

RPC framework usually works in an opposite way: a local process can be viewed as a master process, and you will try to distribute different modules to different remote workers. Note that a remote module does not really have to be allocated to another machine – it can be on a different device of the same machine. The model parallelism idea is distributing different subsets of a module to different devices, which can be on the same machine or different machines. As a user, you shouldn’t feel any difference in the usage though.