# Nan in layer normalization

I have noticed that if I use layer normalization in a small model I can get, sometimes, a nan in the gradient.
I think this is because the model ends up having 0 variances.
I have to mention that I’m experimenting with a really small model (5 hidden unit), but I’m wondering if there is a way to have a more stable solution (adding an epsilon 1^-6 do not solve my problem).

Cheers,
Sandro

How do you compute the normalization ?

It shouldnt give you nan if you divide by (std + epsilon)

1 Like

Yes, I do std + epsilon.

But I noticed that if I get a nan, I get it form the first batch. So I’m wondering if this could relate to the fact that: at the first batch the values are normally distributed ( I’m using xavier initialization), this means that it might actually happen to have small std

Being normally distributed doesn’t mean that it has small stddev. Being standard normally distributed will only have small mean.

1 Like

But from a deeper look, I found out that I got nan only when the hidden unite are all 0. That means that both mean and std are 0. For some reason, if you try this:

``````import torch

loss_fn = torch.nn.MSELoss()
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
r = (x - mean)/(std + 1e-6)
loss.backward()
``````

You got a None grad, which seems odd. Any intuition why this is happening ?

Your r doesn’t retain_grad. You should add `r.retain_grad()` after computing r.

Thx, but I still do not get why the gradient of x is nan.
I understand that the gradient of mean can explode because it is 1/(std+ 1^-6), but the derivate of x should be 1/N

``````import torch

loss_fn = torch.nn.MSELoss()
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
r = (x - mean) / (std + 1e-6)
loss.backward()
This makes no sense. `loss.backward` implicitly backwards a grad_output of 1.