[Solved] Debugging NaNs in gradients

Hi there!

I’ve been training a model and I am constantly running into some problems when doing backpropagation. It turns out that after calling the backward() command on the loss function, there is a point in which the gradients become NaN. I am aware that in pytorch 0.2.0 there is this problem of the gradient of zero becoming NaN (see issue #2421 or some posts in this forum. I have therefore modified reduce.py as indicated in commit #2775 (I somehow cannot build everything from source). This means that when I run the code

x = Variable(torch.zeros(1), requires_grad=True)
out = x * 3

I get the result 3.

But even with this, the problem is not disappearing. I have monitored the weights, and they seem all good until the NaN in the gradient appears (I have checked the magnitude of each of the weights individually). So I was wondering, is there a way I can “quickly” identify what is the source of the NaN in the gradient other than keeping track of every single computation in the model?

Thank you in advance.


The issue you linked is not applicable to your code snippet. It is about the specific norm operation of a zero tensor rather than any operation. Your code snippet works perfectly fine on vanilla 0.2 install.

To debug NaN grad, you can add backward hook at each step of your network, and print to see where they become NaN.


Thanks for your answer Simon. This is strange, since I installed pytorch 0.2.0 from anaconda, ran the snippet, got NaN, updated reduce.py, ran the snippet again and got 3. Anyway, that is not my main concern actually. Thanks for your suggestion, I’ll try it and report back :slight_smile:

1 Like

Hmm I tried running the code before posting and it worked… That is weird.

Let me know if you have new updates.

So, to give a bit more of context, I think I can show a MWE. Essentially I’m running the RBM in https://github.com/odie2630463/Restricted-Boltzmann-Machines-in-pytorch, with just a fundamental modification. To compute free_energy one exponentiates wx_b, adds 1, and takes the log. If wx_b is too large, the exponentiation gives inf, which after adding 1 and taking the log remains as inf, while the result should be negligibly close to wx_b. To avoid this, I modified free_energy by

 def free_energy(self,v):
        vbias_term = v.mv(self.v_bias)
        wx_b = F.linear(v,self.W,self.h_bias)
        hidden_term = wx_b.exp().add(1).log()
        if (hidden_term == np.inf).sum().data.cpu().numpy() != 0:
             hidden_term[hidden_term == np.inf] = wx_b[hidden_term == np.inf]
        hidden_term = hidden_term.sum(1)
        return (-hidden_term - vbias_term).mean()

Also, a minor modification was to substitute the optimizer by Adam and set the learning rate to 5e-2. Anyway, in my more complicated model SGD is giving the NaN in the gradients as well. Also I modified the number of units in the hidden layer to 50.

Even with this fix, I see a NaN appearing in the gradients at some specific point in the training. However, neither the input values, nor the intermediate calculations, nor the value of the loss function the gradients are calculated from seem to have problems in the step just before of getting the NaNgradient.

If it may be useful, I have observed that the first time I get a NaN, it simultaneously appears in just one gradient for the biases of the hidden units (i.e., in one cell of params[2].grad), and one full column for the weights (in params[0].grad), actually the column indexed by the same index as the NaN hidden unit bias gradient. This is, I’m getting NaNs simultaneously in the gradients dL/d(hbias_i) and dL/d(W_{i j}) for just one specific i.

Any suggestion will be much appreciated, this is already driving me crazy…

Not sure how much help this can be, but sometimes I found that NaN gradients came from very large losses.


Hi, thanks Simone. I am actually finding that the magnitudes of the losses are quite high in comparison with other models I have trained (now I have losses in the order of 10, instead of orders of 0.1 in the first moments of training). I have however tried renormalizing the loss function by dividing its value by manitudes as large as 1e6, and the NaNs keep appearing.

I see. Another situation where I got NaN gradients was when using Adam instead of SGD, but you already tried it.

What about increasing Adam’s eps parameter?

I have not increasing Adam’s eps parameter, I’ll try that. But from what I see it is more a problem of the optimizer than the dataset or pytorch itself?

Well, I assume there’s no NaN in the dataset :wink: I don’t know about this being a Pytorch bug… Maybe you could snapshot the state of the model and the input data just before the gradients go NaN, and see if some particular numbers cause the problem for some reason.

1 Like

Yes, you’re right, that was a stupid question. Of course the dataset is fine :sweat_smile:. I have captured that data, but nothing seems unreasonable, at least from just visual inspection. Is there a way to, just given the previous data, compute the gradient? (so I can inspect individually each step that is taken).

Also, something that is bugging me is that, when I do contrastive divergence during several steps, I am not sure whether the computation tree that does the backpropagation is extended. Is there a way of visualizing this?

Also I am tempted to substitute the line



for p in rbm.parameters():
    p.data.add_(-learning_rate, p.grad.data)

This seems to delay the appearance of the error, but it does not prevent it from happening. I’m also trying to see now if the nature of the error is the same.

I think I have arrived to something useful, and it has to do with my attempt to ‘fix’ the function free_energy(). The problem is that, when ‘wx_b’ overflows, the computetion of hidden_term involves a non-continuous (and hence non-differentiable) transformation: the substitution of the cell in hidden_term that evaluates to inf by the corresponding value in wx_b. In an attempt to solve it, I have turned to the log-sum-exp trick, having

def free_energy(self,v):
    vbias_term = v.mv(self.v_bias)
    wx_b = F.linear(v,self.W,self.h_bias)
    zr = Variable(torch.zeros(wx_b.size()))
    mask = torch.max(zr, wx_b)
    hidden_term = (((wx_b - mask).exp() + (-mask).exp()).log() + (mask)).sum(1)
    return (-hidden_term - vbias_term).mean()

This code, for now, is giving a finite value for a gradient that was NaN in an explicit example that I have been able to isolate. I will report if any other problems appear.

EDIT: This indeed does the work. I have trained the model for 50 epochs, while before the NaNs would appear no later than in the 20th epoch. Thank you all for the tips!