Nan in layer normalization

andompesta · February 20, 2018, 3:01pm

I have noticed that if I use layer normalization in a small model I can get, sometimes, a nan in the gradient.
I think this is because the model ends up having 0 variances.
I have to mention that I’m experimenting with a really small model (5 hidden unit), but I’m wondering if there is a way to have a more stable solution (adding an epsilon 1^-6 do not solve my problem).

Cheers,
Sandro

cdancette · February 20, 2018, 3:46pm

How do you compute the normalization ?

It shouldnt give you nan if you divide by (std + epsilon)

andompesta · February 20, 2018, 4:09pm

Yes, I do std + epsilon.

But I noticed that if I get a nan, I get it form the first batch. So I’m wondering if this could relate to the fact that: at the first batch the values are normally distributed ( I’m using xavier initialization), this means that it might actually happen to have small std

SimonW · February 20, 2018, 4:13pm

Being normally distributed doesn’t mean that it has small stddev. Being standard normally distributed will only have small mean.

andompesta · February 21, 2018, 1:51am

True. Thx for the advice.
But from a deeper look, I found out that I got nan only when the hidden unite are all 0. That means that both mean and std are 0. For some reason, if you try this:

import torch

loss_fn = torch.nn.MSELoss()
x = torch.autograd.Variable(torch.zeros(1, 5), requires_grad=True)
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
r = (x - mean)/(std + 1e-6)
loss = loss_fn(r, torch.autograd.Variable(torch.ones(1, 5)))
loss.backward()
r.grad

You got a None grad, which seems odd. Any intuition why this is happening ?

SimonW · February 21, 2018, 3:07am

Your r doesn’t retain_grad. You should add r.retain_grad() after computing r.

andompesta · February 21, 2018, 3:21am

Thx, but I still do not get why the gradient of x is nan.
I understand that the gradient of mean can explode because it is 1/(std+ 1^-6), but the derivate of x should be 1/N

import torch

loss_fn = torch.nn.MSELoss()
x = torch.autograd.Variable(torch.zeros(1, 5), requires_grad=True)
mean = x.mean(-1, keepdim=True)
mean.retain_grad()
std = x.std(-1, keepdim=True)
std.retain_grad()
r = (x - mean) / (std + 1e-6)
r.retain_grad()
loss = loss_fn(r, torch.autograd.Variable(torch.ones(1, 5)))
loss.retain_grad()
loss.backward()
print(loss.grad)

SimonW · February 21, 2018, 4:14am

This makes no sense. loss.backward implicitly backwards a grad_output of 1.

Sorry I didn’t read the entire thread when I first replied. This is a bug at torch.std NaN gradient · Issue #4320 · pytorch/pytorch · GitHub.

andompesta · February 21, 2018, 4:21am

Yes, you are right. I just needed an extra line to stop the debugger.

Thanks so much. I hoped that they fixed it in the 0.4.0a0+94f439c release, but apparently not yet

SimonW · February 21, 2018, 4:23am

Yeah… I’ll keep that on my list of things to fix. But that list is getting quite long. I’ll try my best to fix it soon.