Pytorch doc about LayerNormalization is confusing

gdp · October 10, 2018, 4:39pm

According to my understanding, layer normalization is to normalize across the features (elements) of one example, so all the elements in that example should

(1) use the same mean and variance computed over the example’s elements themselves.
(2) scale and bias via the same parameter gamma and beta

i.e. different elements in one example should use the same normalization parameters, not different parameters for each element. (not per-element)

However, in the officail doc of nn.LayerNorm , a note says

"Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies per-element scale and bias with elementwise_affine "

I think this is confusing… I wonder if there is a mistake in the doc or the implementation is truly as the note says, or if my understanding of layer normalization is wrong. (paper is at https://arxiv.org/abs/1607.06450)

Can anyone give me some suggestions ?

SimonW · October 10, 2018, 5:09pm

PyTorch LN is correct. In Sec. 5.1 of the original paper,

. They also learn an adaptive bias b and gain g for each neuron after the normalization.

gdp · October 10, 2018, 5:25pm

Thanks for your correction. So is the following understanding correct?
Assuming the input minibatch is of size (N, C, H, W). For each example, the elements in all channels (CHW elements) are used to compute the mean and variance to normalize that example, and there are NCHW scale and bias parameters for each element in each channel of each example.

SimonW · October 10, 2018, 6:18pm

Mostly correct! But there are only CHW scale and bias parameters because N is the batch dimension.

This is all based on the assumption that you normalize over the last 3 dimensions. Actually in channeled data, it is unclear whether you should normalize over the channel dimension or not.

gdp · October 11, 2018, 6:10am

So if I do not normalize over channel dimension, I would:
(1) normalize over HW elements for each channel of each example in a minibatch (i.e. compute NC mean and variance values)
(2) use scale and bias parameters for each element in the same channel of every examples in the minibatch (i.e. CHW scale and bias parameters)
Am I correct ?

SimonW · October 11, 2018, 7:14am

No, you would have HW scale and bias parameters. Basically the number of such parameters is equal to the number of elements you normalized over (in each computation of mean and variance).

gdp · October 11, 2018, 7:33am

If I use HW scale and bias parameters, the same location in different channels will share the same scale and bias parameters, is it right ? while in a BatchNorm, different channels use different scale and bias parameters.

SimonW · October 11, 2018, 7:40am

Your understanding is correct.

gdp · October 11, 2018, 7:52am

Oh thanks a lot ! But the LayerNorm paper did not specify its usage in CNN, is the above usage a convention?

SimonW · October 11, 2018, 8:10am

It’s unclear how it should be applied on a CNN, so we provide a lot of flexibility and you can normalize over either C or not. But in either case, I think the paper is pretty clear in that you have a scale and bias for each normalized element.

gdp · October 11, 2018, 8:18am

OK, got it. Thanks !

ado_sar · July 5, 2024, 7:16pm

Still I can’t understand from the docs (emphasis mine):

Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies per-element scale and bias with elementwise_affine

In Batch Norm an adaptive bias b and gain g is also learned for each neuron after the normalization. In other words, Batch Norm takes an input vector x, multiplies it element-wise with g and then adds the bias vector b. Doesn’t layer norm do the same thing with regards to the affine part?

ptrblck · July 5, 2024, 8:52pm

After broadcasting the .weight and .bias as seen here:

x = torch.randn(1, 3, 224, 224)
bn = nn.BatchNorm2d(3)

print(bn.weight)
# Parameter containing:
# tensor([1., 1., 1.], requires_grad=True)
print(bn.bias)
# Parameter containing:
# tensor([0., 0., 0.], requires_grad=True)
ref = bn(x)

# manual approach
out = (x - x.mean([0, 2, 3], keepdim=True)) / torch.sqrt(x.var([0, 2, 3], unbiased=False, keepdim=True) + bn.eps)
out = out * bn.weight[None, :, None, None] + bn.bias[None, :, None, None]

print((out - ref).abs().max())
# tensor(4.7684e-07, grad_fn=<MaxBackward1>)

since both trainable parameters have num_channels values.

ado_sar · July 5, 2024, 9:14pm

Apologize if misunderstood.

Batch Norm takes an input vector x, multiplies it element-wise with g and then adds the bias vector b.

I am referring to the original papers of BatchNorm and LayerNorm where they consider simple feed-forward networks (MLPs). The only difference is how the mean and the variance are calculated. That is, assuming a batched input X of shape (N, n_feats), then BatchNorm calculates the mean (same for variance) as X.mean(dim=0) while LayerNorm calculates it as X.mean(dim=1).

Other than that, they both learn n_feats gains and biases. Isn’t this correct?