Hi,

When we train the models, in the backward pass, we compute not only graident of model paramters but also tensors. For example, if there is a linear layer:

```
y = x * w
```

Where `w`

is the model parameter, and `x`

is the input tensor. If we compute backward pass, we should compute not only graident of `w`

, but also graident of `x`

.

What if I train with multi-gpus? I learned from the docs that the gradient of `w`

of each gpu will be averages. But I do not know what would be done with gradient of `x`

. Will graident of `x`

also be averaged by `DistributedDataParallel`

at each layer or only gradient of `w`

get averaged after the backward pass is finished of each iteration?