I have what I hope to be a simple question - when mu and variance are calculated in the batchnorm layer, are the gradients propagated to the scaling? I.e., are the mu and var in y = (x - mu) / sqrt(var + eps) simple numbers or the gradient tracked tensors?
I’m asking because I want to implement a modified version of batchnorm using the variance of the dimension before dropout is applied. I need to know if I need to call .detach() or not on mu and var.
Does batch norm in training mode normalize to the current batch statistics, or to the current running stats? I assume the former because I have a case were batch statistics are actually tied to a bag of instanced so they aren’t comparable across bags (so I’m using running stats = False).
The gradients should not be detached. This is actually explained in the 2nd page of the original batchnorm paper. imagine the loss function “wants” to increase the value of a batchnormed activation because of a bias in the targets (i.e. independent of the input to the network), if you detach the mean, then the gradients will cause the pre-normed activation to increase all across the batch, causing the difference between the sample value and the batch mean to remain constant, causing a non-stop drift of the pre-normed activations. they explain it better in the paper: “As the training continues, b will grow indefinitely while the loss remains fixed. This problem can get worse if the normalization not only centers but also scales the activations. We have observed this empirically in initial experiments, where the model blows up when the normalization parameters are computed outside the gradient descent step”