According to my understanding, layer normalization is to normalize across the features (elements) of one example, so all the elements in that example should

(1) use the same mean and variance computed over the example’s elements themselves.
(2) scale and bias via the same parameter gamma and beta

i.e. different elements in one example should use the same normalization parameters, not different parameters for each element. (not per-element)

However, in the officail doc of nn.LayerNorm , a note says

"Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies per-element scale and bias with elementwise_affine "

I think this is confusing… I wonder if there is a mistake in the doc or the implementation is truly as the note says, or if my understanding of layer normalization is wrong. (paper is at https://arxiv.org/abs/1607.06450)

Thanks for your correction. So is the following understanding correct?
Assuming the input minibatch is of size (N, C, H, W). For each example, the elements in all channels (CHW elements) are used to compute the mean and variance to normalize that example, and there are NCHW scale and bias parameters for each element in each channel of each example.

Mostly correct! But there are only CHW scale and bias parameters because N is the batch dimension.

This is all based on the assumption that you normalize over the last 3 dimensions. Actually in channeled data, it is unclear whether you should normalize over the channel dimension or not.

So if I do not normalize over channel dimension, I would:
(1) normalize over HW elements for each channel of each example in a minibatch (i.e. compute NC mean and variance values)
(2) use scale and bias parameters for each element in the same channel of every examples in the minibatch (i.e. CHW scale and bias parameters)
Am I correct ?

No, you would have HW scale and bias parameters. Basically the number of such parameters is equal to the number of elements you normalized over (in each computation of mean and variance).

If I use HW scale and bias parameters, the same location in different channels will share the same scale and bias parameters, is it right ? while in a BatchNorm, different channels use different scale and bias parameters.

It’s unclear how it should be applied on a CNN, so we provide a lot of flexibility and you can normalize over either C or not. But in either case, I think the paper is pretty clear in that you have a scale and bias for each normalized element.