Can anyone help me understand how GroupNorm(num_groups=1) and LayerNorm can be equivalent?
I tried the following code modified from the original GN document link and I found the two functions are not equivent: (I check the initialization for both functions and they are the same)
>>> import torch
>>> import torch.nn as nn
>>>
>>> input = torch.randn(20, 6, 10, 10)
>>> g = nn.GroupNorm(1, 6)
>>> l = nn.LayerNorm(6)
>>> print((g(input) - l(input.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)).norm())
tensor(55.1597, grad_fn=<CopyBackwards>)
Then I tried another code:
>>> import torch
>>> import torch.nn as nn
>>>
>>> input = torch.randn(20, 6, 10, 10)
>>> g = nn.GroupNorm(1, 6)
>>> l = nn.LayerNorm((6, 10, 10))
>>> print((g(input) - l(input)).norm())
tensor(0., grad_fn=<CopyBackwards>)
>>> print(g.weight.shape)
torch.Size([6])
>>> print(l.weight.shape)
torch.Size([6, 10, 10])
Although it gives the same result, you can see that the dimension of the parameters of the two functions are not the same at all.
So I am wondering, in what sense, the GroupNorm(num_groups=1) and LayerNorm are equivalent? Do I miss something here?