LayerNorm Misunderstanding

Hey guys! Out of interest, I wanted to reimplement the nn.LayerNorm functionality but I cannot wrap my head around a dummy example; I expected the result of both ref and out to be the same. For context, the embedding is supposed to be a single sentence (batch_size = 1) with two words and each word dimension equals to two. Thank you a lot!

import torch
import torch.nn as nn 

embedding = torch.FloatTensor([[[2,1],[3,4]]])

layer_norm = nn.LayerNorm([2])
ref = layer_norm(embedding)

mean = embedding.mean(dim=(1,2)) 
std = embedding.std(dim=(1,2)) 
out = (embedding - mean) / (std + 1e-5)


out, ref
=>
 (tensor([[[-0.3873, -1.1619],
          [ 0.3873,  1.1619]]]), tensor([[[ 1.0000, -1.0000],
          [-1.0000,  1.0000]]], grad_fn=<NativeLayerNormBackward0>))

It seems like your script doesn’t apply affine transformation after the normalization (ref: LayerNorm — PyTorch 1.13 documentation).
I think it’d be necessary to initialize weight and bias parameters (gamma and beta in the doc above) and apply them to out in your snippet, e.g. weight * out + bias.

1 Like

the default values for the weights are only ones and only zeros for the biases, that’s why I left them out or am I mistaken?