How does gradients gather on the main GPU when using nn.Dataparallel?

I have a set of pictures for the working principle of nn.Dataparallel:

I have a question on the last picture. How does gradients actually gather on the main GPU? If there’s a Conv2d, its related gradient on GPU:0 is w0 (with shape (256, 128, 3, 3)) and its related gradient on GPU:1 is w1 (with shape (256, 128, 3, 3)). So the final gradient for this Conv2d on GPU:0 is w0 + w1 ? And the ‘+’ here means element-wise add?


We use the autograd to automatically accumulate the gradients inside the original Parameter whichi is on the main GPU.
Yes the gradients are added element-wise. The same as if you were re-using a given Tensor in multiple expressions when computing your loss.

1 Like