# GroupNorm(num_groups=1) and LayerNorm are not equivalent?

Can anyone help me understand how GroupNorm(num_groups=1) and LayerNorm can be equivalent?

I tried the following code modified from the original GN document link and I found the two functions are not equivent: (I check the initialization for both functions and they are the same)

``````>>> import torch
>>> import torch.nn as nn
>>>
>>> input = torch.randn(20, 6, 10, 10)
>>> g = nn.GroupNorm(1, 6)
>>> l = nn.LayerNorm(6)
>>> print((g(input) - l(input.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)).norm())
``````

Then I tried another code:

``````>>> import torch
>>> import torch.nn as nn
>>>
>>> input = torch.randn(20, 6, 10, 10)
>>> g = nn.GroupNorm(1, 6)
>>> l = nn.LayerNorm((6, 10, 10))
>>> print((g(input) - l(input)).norm())
>>> print(g.weight.shape)
torch.Size()
>>> print(l.weight.shape)
torch.Size([6, 10, 10])
``````

Although it gives the same result, you can see that the dimension of the parameters of the two functions are not the same at all.

So I am wondering, in what sense, the GroupNorm(num_groups=1) and LayerNorm are equivalent? Do I miss something here?

Hmm… Let’s do some checks.

``````import torch
from torch import nn

x = torch.randn(20, 6, 10, 10)
g = nn.GroupNorm(1, 6, affine=False)
l = nn.LayerNorm((6, 10, 10), elementwise_affine=False)

y_g = g(x)
y_l = l(x)
(y_g - y_l).pow(2).sum().sqrt()
#  tensor(5.1145e-06)
``````

Now, when you type `nn.LayerNorm(6)`, you’re instructing `torch` to compute the normalisation over a single dimension, i.e. the last one. So, when you feed the permutated input to your `LayerNorm` module, it will compute a normalisation only over the 6 channels, per every location of the map. No wonder you got a difference of `55.1597` in your first snippet.

Finally, `GroupNorm` uses a (global) channel-wise learnable scale and bias, while `LayerNorm` has a (local) scale and bias for each location as well. Unless you share them across all locations for `LayerNorm`, `LayerNorm` will be more flexible than `GroupNorm` using a single group. You can see how their CPP implementation differs below.

`group_norm_kernel.cpp`

``````// global scale and bias

for (const auto k : c10::irange(HxW)) {
Y_ptr[k] = scale * X_ptr[k] + bias;
}
``````

`layer_norm_kernel.cpp`

``````// per location scale and bias
vec::map3<T>(
[scale, bias](Vec x, Vec gamma, Vec beta) {
return (x * Vec(scale) + Vec(bias)) * gamma + beta;
},
Y_ptr,
X_ptr,
gamma_data,
beta_data,
N
);
``````

Where `map` tells you that (citing Wikipedia):

a simple operation is applied to all elements of a sequence, potentially in parallel . It is used to solve embarrassingly parallel problems: those problems that can be decomposed into independent subtasks, requiring no communication/synchronization between the subtasks

Answering your question, `GroupNorm(num_groups=1)` and `LayerNorm` are not equivalent, unless followed by a fully-connected layer.