Why grad aggregation methods are different in DataParallel and DistributedDataParallel?

In the documentation of DataParallel, it says:

During the backwards pass, gradients from each replica are **summed** into the original module.

In the documentation of DistributedDataParallel, it says:

During the backwards pass, gradients from each node are **averaged**.

I think this difference caused the phenomenon described here:

Important: The default learning rate is for 8 GPUs. If you use less or more than 8 GPUs, you need to set the learning rate proportional to the GPU num. E.g., modify lr to 0.01 for 4 GPUs or 0.04 for 16 GPUs.

I think in DataParallel mode, only one model has been established, while in DistributedDataParallel, each compute node has one independent model. So when you pass data, Dataparallel just splits input to different nodes and after forward , loss must be passed to master node. To concretely, you can refer this link: Is average the correct way for the gradient in DistributedDataParallel with multi nodes?

I’m astonished that gradients are summed in the batch dimension:

import torch
import torch.nn as nn

x = torch.tensor([5.0, -1.0], dtype=torch.float).cuda().view(-1, 1)

model = nn.Linear(in_features=1, out_features=1, bias=False).cuda()

model.weight.data.zero_()
model.weight.data.add_(1.0)

y = model(x)

label = torch.zeros(2, 1, dtype=torch.float).cuda()

loss = torch.sum((y - label)**2)

loss.backward()

print(model.weight.grad)

The code above prints 52.0.

I’m astonished given that the gradient is an approximation of the truth gradient, and it is supposed to be unbiased. The natural way is to take an average over the batch dimension, but pytorch should sum the grad over the batch dimension.