Hi,
When we train the models, in the backward pass, we compute not only graident of model paramters but also tensors. For example, if there is a linear layer:
y = x * w
Where w
is the model parameter, and x
is the input tensor. If we compute backward pass, we should compute not only graident of w
, but also graident of x
.
What if I train with multi-gpus? I learned from the docs that the gradient of w
of each gpu will be averages. But I do not know what would be done with gradient of x
. Will graident of x
also be averaged by DistributedDataParallel
at each layer or only gradient of w
get averaged after the backward pass is finished of each iteration?