Torch.nn.parallel.data_parallel for distributed training: backward pass model update

When using torch.nn.parallel.data_parallel for distributed training, models are copied onto multiple GPU’s and can complete a forward pass without the copied models effecting each other. However, how do the copied models interact during the backward pass? How are the model weights updated on each GPU?

When reading the documentation [1] I see the explanation:
“gradients from each replica are summed into the original module”

but I’m not sure how to interpret this statement. My best guess is that each GPU’s model is updated with the sum of all GPU’s model’s gradients, which I would then interpret that there is locking across GPU’s so they each start training on a new mini-batch only after they all finish processing their current mini-batch.



Each gpu compute the gradients for it’s part of the batch. Then they are accumulated on the “main” model where the weight update is done. Then this “main” model shares it’s weight to all the other gpus so that all models have the same weight.


Hi Alban,

Thanks for clarifying. Does this mean that this parallelization utilizes locking, ensuring that each GPU model updates its weights from the “main” model before moving on to the next mini-batch?

Yes the locking is builtin and the weights will properly be updated before they are used.


Hey, I am facing somme issue with data parallel. I am training on 4 v100 with a batchsize of 1. The time of forward pass seems to scale but the time of backward pass is taking 4* times in comparison to 1 V100. So there is no significant boost in speed when using 4 gpus. I guess the backward pass is taking place on single gpu. I am using nn.parallel.DataParallel, is there any solution to this problem? I can share more details if you want.

Hey @sanchit2843

When using batch_size == 1, DataParallel won’t be able to parallelize the computation, as the input cannot be chunked by the scatter linked below.

but the time of backward pass is taking 4* times in comparison to 1 V100

Can you share a minimum repro of this? In general, the 4X slowdown is possible due to Python GIL contention (which you can avoid by using DistributedDataParallel). But I feel it this should not be applicable here as each batch only contains one sample, and the replicated models are not really involved in forward and backward.

Hey thanks for the answer, actually I think it is because of all 4 backpropogation going with one gpu only. Is there any solution with which I can run it with one batch size. And what is the meaning of repro? Thanks in advance.

I tried with batchsize of 2 as well. The epoch time with single gpu is 2.5 hours and with four gpu is 3.3 hrs. Any solution will be helpful.