Combining DDP with model parallelism in a specific way


I have 4 GPU devices. Let my network have 10 layers {L0, L1, L2, L3} and {L4, L5, …, L10}. I want to do a hybrid of DDP and model parallelism.

  1. Each Li, where i=0,1,2,3 should go to GPU:i
  2. A copy of {L4, …L10} has to be in each GPU. Hence the network in GPU:i should be {Li, L4, …, L10}.
  3. In each pass, we want to gather the forward pass outputs from each GPU, compute a loss and backpropagate and synchronize gradients (although data passed through each of the GPUs is the same, so not “data parallel” in that sense just that the copy of a part of the model is same in all the devices).

Issue: In the standard distributed documentation and model parallelism demos, the network is always sliced sequentially and given to the devices, for example {L0, L1} in GPU:0, {L2, L3} in GPU:1, and so on, while in this case a part of the model is different in each device and a part of the model is same.

Is there a starightforward way to implement this that I’m missing? And what all difficulties might I run into while doing so?

I am also facing a similar issue. @smth @ptrblck any help with this?

Have you tried to nest DDP inside each pipeline stage?

One issue you might run into is that you need to use DDP with grad accumulation and explicitly control that bit. You can do it with no_sync

cc @jamesr66a as he’s the expert in PP.