Hi,
I have 4 GPU devices. Let my network have 10 layers {L0, L1, L2, L3} and {L4, L5, …, L10}. I want to do a hybrid of DDP and model parallelism.
- Each Li, where i=0,1,2,3 should go to GPU:i
- A copy of {L4, …L10} has to be in each GPU. Hence the network in GPU:i should be {Li, L4, …, L10}.
- In each pass, we want to gather the forward pass outputs from each GPU, compute a loss and backpropagate and synchronize gradients (although data passed through each of the GPUs is the same, so not “data parallel” in that sense just that the copy of a part of the model is same in all the devices).
Issue: In the standard distributed documentation and model parallelism demos, the network is always sliced sequentially and given to the devices, for example {L0, L1} in GPU:0, {L2, L3} in GPU:1, and so on, while in this case a part of the model is different in each device and a part of the model is same.
Is there a starightforward way to implement this that I’m missing? And what all difficulties might I run into while doing so?