I am not sure about DistributedParallel but in DataParallel each GPU gets a copy of the model, so, the parallelization is done via splitting the minibatches, not the layers/weights.
Here’s a sketch of how DataParallel works, assuming 4 GPUs where GPU:0 is the default GPU.
