According to my understanding, layer normalization is to normalize across the features (elements) of one example, so all the elements in that example should
(1) use the same mean and variance computed over the example’s elements themselves.
(2) scale and bias via the same parameter gamma and beta
i.e. different elements in one example should use the same normalization parameters, not different parameters for each element. (not per-element)
However, in the officail doc of nn.LayerNorm , a note says
"Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the
affine option, Layer Normalization applies per-element scale and bias with
I think this is confusing… I wonder if there is a mistake in the doc or the implementation is truly as the note says, or if my understanding of layer normalization is wrong. (paper is at https://arxiv.org/abs/1607.06450)
Can anyone give me some suggestions ?