Does group norm maintain an running average of mean and variance?

Does group norm maintain an running average of mean and variance ?

1 Like

Looking at the code here: https://pytorch.org/docs/stable/_modules/torch/nn/modules/normalization.html

Neither group norm nor layer norm seem to maintain running averages. The description of them suggests they might: https://pytorch.org/docs/stable/nn.html?highlight=group%20norm#torch.nn.GroupNorm

“this layer uses statistics computed from input data in both training and evaluation modes”

Whether or not they are supposed to I don’t know. I don’t see running averages in the tensorflow version of group norm either: https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/contrib/layers/python/layers/normalization.py (group_norm)

Or layer norm for that matter:

https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/contrib/layers/python/layers/layers.py (layer_norm)

As both compute the mean and std for the batch dim, i.e the mean’s shape is (N, 1) in layer norm, tracking a running average doesn’t make sense. Who is to say something similar will be at that exact position in your validation batch?

4 Likes

I also found the doc confusing. How can I freeze running statistics temporality to use other data? Thank you!

1 Like

As both compute the mean and std for the batch dim, i.e the mean’s shape is (N, 1) in layer norm, tracking a running average doesn’t make sense. Who is to say something similar will be at that exact position in your validation batch?

I don’t think this is a good explanation. Sure, for LayerNorm, the computed statistics are of the shape (N,), but they can be used as separate values for updating running statistics of shape (1,). InstanceNorm, for example, computes per-batch and per-channel statistics of shape (N, C). However, this isn’t directly used since, as you mentioned, it would NOT be invariant to batch permutations. Instead statistics of shape (C,) are maintained; see _NormBase used by both, BatchNorm and InstanceNorm.

Similar argument holds for GroupNorm. For GroupNorm in particular, the authors explicitly discourage using running statistics; see Section 2 of the paper:

The pre-computed statistics may also change when the target data distribution changes [45]. These issues lead to inconsistency at training, transferring, and testing time. In addition, as aforementioned, reducing the batch size can have dramatic impact on the estimated batch statistics.

Similar remarks can also be found in Section 1 of LayerNorm paper. Perhaps, that is the reason running statistics are not computed for these two methods.