I have noticed that if I use layer normalization in a small model I can get, sometimes, a nan in the gradient.
I think this is because the model ends up having 0 variances.
I have to mention that I’m experimenting with a really small model (5 hidden unit), but I’m wondering if there is a way to have a more stable solution (adding an epsilon 1^-6 do not solve my problem).
But I noticed that if I get a nan, I get it form the first batch. So I’m wondering if this could relate to the fact that: at the first batch the values are normally distributed ( I’m using xavier initialization), this means that it might actually happen to have small std
True. Thx for the advice.
But from a deeper look, I found out that I got nan only when the hidden unite are all 0. That means that both mean and std are 0. For some reason, if you try this:
import torch
loss_fn = torch.nn.MSELoss()
x = torch.autograd.Variable(torch.zeros(1, 5), requires_grad=True)
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
r = (x - mean)/(std + 1e-6)
loss = loss_fn(r, torch.autograd.Variable(torch.ones(1, 5)))
loss.backward()
r.grad
You got a None grad, which seems odd. Any intuition why this is happening ?
Thx, but I still do not get why the gradient of x is nan.
I understand that the gradient of mean can explode because it is 1/(std+ 1^-6), but the derivate of x should be 1/N
import torch
loss_fn = torch.nn.MSELoss()
x = torch.autograd.Variable(torch.zeros(1, 5), requires_grad=True)
mean = x.mean(-1, keepdim=True)
mean.retain_grad()
std = x.std(-1, keepdim=True)
std.retain_grad()
r = (x - mean) / (std + 1e-6)
r.retain_grad()
loss = loss_fn(r, torch.autograd.Variable(torch.ones(1, 5)))
loss.retain_grad()
loss.backward()
print(loss.grad)