How pytorch's parallel method and distributed method works?

rasbt · November 24, 2018, 5:03am

I am not sure about DistributedParallel but in DataParallel each GPU gets a copy of the model, so, the parallelization is done via splitting the minibatches, not the layers/weights.

Here’s a sketch of how DataParallel works, assuming 4 GPUs where GPU:0 is the default GPU.