The following paper[1] shows experimental results that “the effect of regularization was concentrated on the BN layer,” and we should be cautious about making the behavior such that the weight decay of the BN layer is off by default.
As evidence, we found that almost all of the regularization effect of weight decay was due to
applying it to layers with BN (for which weight decay is meaningless).
It seems that the mechanism of weight decay is not fully understood even in the research field. At least until there is a clear empirical and theoretical basis, the above modification should be withheld.
[1] [1810.12281] Three Mechanisms of Weight Decay Regularization