Understanding torch.nn.LayerNorm in nlp

I’m trying to understanding how torch.nn.LayerNorm works in a nlp model. Asuming the input data is a batch of sequence of word embeddings:

batch_size, seq_size, dim = 2, 3, 4
embedding = torch.randn(batch_size, seq_size, dim)
print("x: ", embedding)

layer_norm = torch.nn.LayerNorm(dim)
print("y: ", layer_norm(embedding))

# outputs:
"""
x:  tensor([[[ 0.5909,  0.1326,  0.8100,  0.7631],
         [ 0.5831, -1.7923, -0.1453, -0.6882],
         [ 1.1280,  1.6121, -1.2383,  0.2150]],

        [[-0.2128, -0.5246, -0.0511,  0.2798],
         [ 0.8254,  1.2262, -0.0252, -1.9972],
         [-0.6092, -0.4709, -0.8038, -1.2711]]])
y:  tensor([[[ 0.0626, -1.6495,  0.8810,  0.7060],
         [ 1.2621, -1.4789,  0.4216, -0.2048],
         [ 0.6437,  1.0897, -1.5360, -0.1973]],

        [[-0.2950, -1.3698,  0.2621,  1.4027],
         [ 0.6585,  0.9811, -0.0262, -1.6134],
         [ 0.5934,  1.0505, -0.0497, -1.5942]]],
       grad_fn=<NativeLayerNormBackward0>)
"""

From the document’s description, my understanding is that the mean and std are computed by all embedding values per sample. So I try to compute y[0, 0, :] manually:

mean = torch.mean(embedding[0, :, :])
std = torch.std(embedding[0, :, :])
print((embedding[0, 0, :] - mean) / std)

which gives tensor([ 0.4310, -0.0319, 0.6523, 0.6050]) and that’s not the right output. I want to know what is the right way to compute y[0, 0, :] ?

This should work:

batch_size, seq_size, dim = 2, 3, 4
embedding = torch.randn(batch_size, seq_size, dim)
print("x: ", embedding)

layer_norm = torch.nn.LayerNorm(dim)
y = layer_norm(embedding)
print("y: ", y)

out = (embedding - torch.mean(embedding, dim=2, keepdims=True)) / torch.sqrt(torch.var(embedding, dim=2, keepdims=True, unbiased=False) + layer_norm.eps)

print((out - y).abs().max())
# > tensor(1.1921e-07, grad_fn=<MaxBackward1>)

I found the problem. I need to add unbiased=False in torch.std(embedding[0, :, :])