Use of BatchNorm1d seems to prevent some model parameters from being updated

ptrblck · November 5, 2021, 9:21am

Thanks for the great code and the interesting question!
Based on your output (and my runs) it seems that the bias of the preceding linear layer is not updated.
While the check claims so, you could check the gradients of the bias parameter and would see that these gradients are indeed really small:

model.fc1.bias.grad
> tensor([ 5.9605e-08, -1.7881e-07])

and thus the bias value is not changed.

Now, this might be surprising, but let’s think about the first operation of a batchnorm layer:

out = (x - mean) / stddev * weight + bias

As you can see, the mean of the incoming batch would be subtracted. If the previous layer has added a bias to the activation, it would be directly subtracted again.
Wouldn’t this also mean that this parameter has (almost) zero influence on the loss calculation (up to the limited numerical precision limit)?
The performance guide also mentions:

If a nn.Conv2d layer is directly followed by a nn.BatchNorm2d layer, then the bias in the convolution is not needed, instead use nn.Conv2d(..., bias=False, ....). Bias is not needed because in the first step BatchNorm subtracts the mean, which effectively cancels out the effect of bias.
This is also applicable to 1d and 3d convolutions as long as BatchNorm (or other normalization layer) normalizes on the same dimension as convolution’s bias.

However, this targets mainly the performance and would thus avoid the unnecessary add kernel.

CC @tom and @KFrank to correct me here in case my explanation is wrong. I’m also sure both can add interesting explanations from a more mathematical point of view.