Using torch rpc to connect to remote machine

Hi, I have a use case where I want to train a model for which some layers are on a remote client machine, and most of the layers are on a GPU server. I understand that the torch.distributed.rpc library is there to help with this, but for all of the tutorials I see, there is no example of connecting to a remote host/port, and all the examples are intra-machine. Has anyone tried doing something similar before, or is there some tutorial/code snippets someone can point me towards?

1 Like

You have to use torchrun to connect them. I had the same problem and also asked about it (RPC + Torchrun hangs in ProcessGroupGloo). In my question you have a minimal example and also the torchrun command

EDIT: Omg, Iā€™m seeing now your question was from 2022 :joy: