Hi, I have a use case where I want to train a model for which some layers are on a remote client machine, and most of the layers are on a GPU server. I understand that the torch.distributed.rpc
library is there to help with this, but for all of the tutorials I see, there is no example of connecting to a remote host/port, and all the examples are intra-machine. Has anyone tried doing something similar before, or is there some tutorial/code snippets someone can point me towards?
1 Like
You have to use torchrun to connect them. I had the same problem and also asked about it (RPC + Torchrun hangs in ProcessGroupGloo). In my question you have a minimal example and also the torchrun command
EDIT: Omg, Iām seeing now your question was from 2022