Using torch rpc to connect to remote machine

mndl1 · December 10, 2022, 5:55pm

Hi, I have a use case where I want to train a model for which some layers are on a remote client machine, and most of the layers are on a GPU server. I understand that the torch.distributed.rpc library is there to help with this, but for all of the tutorials I see, there is no example of connecting to a remote host/port, and all the examples are intra-machine. Has anyone tried doing something similar before, or is there some tutorial/code snippets someone can point me towards?

patataman · February 21, 2024, 9:38am

You have to use torchrun to connect them. I had the same problem and also asked about it (RPC + Torchrun hangs in ProcessGroupGloo). In my question you have a minimal example and also the torchrun command

EDIT: Omg, I’m seeing now your question was from 2022

mndl1 · June 7, 2025, 9:33pm

No worries, and thanks for the reply! Better late than never!