Hi, I want to do something exactly the same as that outlined in this paper https://arxiv.org/pdf/1903.11314.pdf under the model parallelism section.
“3.2.2 Model Parallelism. In model parallelism, the DL model is split, and each worker loads a
different part of the DL model for training (see Figure 5). The worker(s) that hold the input layer of
the DL model are fed with the training data. In the forward pass, they compute their output signal
which is propagated to the workers that hold the next layer of the DL model. In the backpropagation
pass, gradients are computed starting at the workers that hold the output layer of the DL model,
propagating to the workers that hold the input layers of the DL model.”
How would I implement the following “toy” example -
-
The “input layer” as described in the paper will be on my laptop, and only my laptop will contain the training data.
-
The rest of the layers will be on AWS - apart from the training labels, I do not want to pass in any training data to my AWS instance in this toy example.
Has someone does this? Is there any code someone can point me to look at? I understand I will have to use torch.distributed’s modules and rpc, but is there a reference I can look at?