Thanks for the great code and the interesting question!
Based on your output (and my runs) it seems that the bias of the preceding linear layer is not updated.
While the check claims so, you could check the gradients of the bias
parameter and would see that these gradients are indeed really small:
model.fc1.bias.grad
> tensor([ 5.9605e-08, -1.7881e-07])
and thus the bias value is not changed.
Now, this might be surprising, but let’s think about the first operation of a batchnorm layer:
out = (x - mean) / stddev * weight + bias
As you can see, the mean of the incoming batch would be subtracted. If the previous layer has added a bias to the activation, it would be directly subtracted again.
Wouldn’t this also mean that this parameter has (almost) zero influence on the loss calculation (up to the limited numerical precision limit)?
The performance guide also mentions:
If a
nn.Conv2d
layer is directly followed by ann.BatchNorm2d
layer, then the bias in the convolution is not needed, instead usenn.Conv2d(..., bias=False, ....)
. Bias is not needed because in the first stepBatchNorm
subtracts the mean, which effectively cancels out the effect of bias.
This is also applicable to 1d and 3d convolutions as long asBatchNorm
(or other normalization layer) normalizes on the same dimension as convolution’s bias.
However, this targets mainly the performance and would thus avoid the unnecessary add
kernel.
CC @tom and @KFrank to correct me here in case my explanation is wrong. I’m also sure both can add interesting explanations from a more mathematical point of view.