I have what I hope to be a simple question - when mu and variance are calculated in the batchnorm layer, are the gradients propagated to the scaling? I.e., are the mu and var in y = (x - mu) / sqrt(var + eps) simple numbers or the gradient tracked tensors?
I’m asking because I want to implement a modified version of batchnorm using the variance of the dimension before dropout is applied. I need to know if I need to call .detach() or not on mu and var.
Does batch norm in training mode normalize to the current batch statistics, or to the current running stats? I assume the former because I have a case were batch statistics are actually tied to a bag of instanced so they aren’t comparable across bags (so I’m using running stats = False).