Will DDP reduce gradient of only parameters or all tensors?


When we train the models, in the backward pass, we compute not only graident of model paramters but also tensors. For example, if there is a linear layer:

y = x * w

Where w is the model parameter, and x is the input tensor. If we compute backward pass, we should compute not only graident of w, but also graident of x.

What if I train with multi-gpus? I learned from the docs that the gradient of w of each gpu will be averages. But I do not know what would be done with gradient of x. Will graident of x also be averaged by DistributedDataParallel at each layer or only gradient of w get averaged after the backward pass is finished of each iteration?

The parameter gradient should be averaged as the data gradient (in x) depends on the input data.
Since you are training with a different batch on each GPU you wouldn’t need to synchronize the data gradient unless I’m missing something in your use case.

Thanks, I figured it out now.