Nan in layer normalization

I have noticed that if I use layer normalization in a small model I can get, sometimes, a nan in the gradient.
I think this is because the model ends up having 0 variances.
I have to mention that I’m experimenting with a really small model (5 hidden unit), but I’m wondering if there is a way to have a more stable solution (adding an epsilon 1^-6 do not solve my problem).

Cheers,
Sandro

How do you compute the normalization ?

It shouldnt give you nan if you divide by (std + epsilon)

1 Like

Yes, I do std + epsilon.

But I noticed that if I get a nan, I get it form the first batch. So I’m wondering if this could relate to the fact that: at the first batch the values are normally distributed ( I’m using xavier initialization), this means that it might actually happen to have small std

Being normally distributed doesn’t mean that it has small stddev. Being standard normally distributed will only have small mean.

1 Like

True. Thx for the advice.
But from a deeper look, I found out that I got nan only when the hidden unite are all 0. That means that both mean and std are 0. For some reason, if you try this:

import torch

loss_fn = torch.nn.MSELoss()
x = torch.autograd.Variable(torch.zeros(1, 5), requires_grad=True)
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
r = (x - mean)/(std + 1e-6)
loss = loss_fn(r, torch.autograd.Variable(torch.ones(1, 5)))
loss.backward()
r.grad

You got a None grad, which seems odd. Any intuition why this is happening ?

Your r doesn’t retain_grad. You should add r.retain_grad() after computing r.

Thx, but I still do not get why the gradient of x is nan.
I understand that the gradient of mean can explode because it is 1/(std+ 1^-6), but the derivate of x should be 1/N

import torch

loss_fn = torch.nn.MSELoss()
x = torch.autograd.Variable(torch.zeros(1, 5), requires_grad=True)
mean = x.mean(-1, keepdim=True)
mean.retain_grad()
std = x.std(-1, keepdim=True)
std.retain_grad()
r = (x - mean) / (std + 1e-6)
r.retain_grad()
loss = loss_fn(r, torch.autograd.Variable(torch.ones(1, 5)))
loss.retain_grad()
loss.backward()
print(loss.grad)

This makes no sense. loss.backward implicitly backwards a grad_output of 1.

Sorry I didn’t read the entire thread when I first replied. This is a bug at torch.std NaN gradient · Issue #4320 · pytorch/pytorch · GitHub.

1 Like

Yes, you are right. I just needed an extra line to stop the debugger.

Thanks so much. I hoped that they fixed it in the 0.4.0a0+94f439c release, but apparently not yet

Yeah… I’ll keep that on my list of things to fix. But that list is getting quite long. I’ll try my best to fix it soon.