I’m trying to understanding how `torch.nn.LayerNorm`

works in a nlp model. Asuming the input data is a batch of sequence of word embeddings:

```
batch_size, seq_size, dim = 2, 3, 4
embedding = torch.randn(batch_size, seq_size, dim)
print("x: ", embedding)
layer_norm = torch.nn.LayerNorm(dim)
print("y: ", layer_norm(embedding))
# outputs:
"""
x: tensor([[[ 0.5909, 0.1326, 0.8100, 0.7631],
[ 0.5831, -1.7923, -0.1453, -0.6882],
[ 1.1280, 1.6121, -1.2383, 0.2150]],
[[-0.2128, -0.5246, -0.0511, 0.2798],
[ 0.8254, 1.2262, -0.0252, -1.9972],
[-0.6092, -0.4709, -0.8038, -1.2711]]])
y: tensor([[[ 0.0626, -1.6495, 0.8810, 0.7060],
[ 1.2621, -1.4789, 0.4216, -0.2048],
[ 0.6437, 1.0897, -1.5360, -0.1973]],
[[-0.2950, -1.3698, 0.2621, 1.4027],
[ 0.6585, 0.9811, -0.0262, -1.6134],
[ 0.5934, 1.0505, -0.0497, -1.5942]]],
grad_fn=<NativeLayerNormBackward0>)
"""
```

From the document’s description, my understanding is that the mean and std are computed by all embedding values per sample. So I try to compute `y[0, 0, :]`

manually:

```
mean = torch.mean(embedding[0, :, :])
std = torch.std(embedding[0, :, :])
print((embedding[0, 0, :] - mean) / std)
```

which gives `tensor([ 0.4310, -0.0319, 0.6523, 0.6050])`

and that’s not the right output. I want to know what is the right way to compute `y[0, 0, :]`

?