How pytorch's parallel method and distributed method works?

I am not sure about DistributedParallel but in DataParallel each GPU gets a copy of the model, so, the parallelization is done via splitting the minibatches, not the layers/weights.

Here’s a sketch of how DataParallel works, assuming 4 GPUs where GPU:0 is the default GPU.

10 Likes