Hi there!
I’ve been training a model and I am constantly running into some problems when doing backpropagation. It turns out that after calling the backward()
command on the loss function, there is a point in which the gradients become NaN. I am aware that in pytorch 0.2.0 there is this problem of the gradient of zero becoming NaN (see issue #2421 or some posts in this forum. I have therefore modified reduce.py
as indicated in commit #2775 (I somehow cannot build everything from source). This means that when I run the code
x = Variable(torch.zeros(1), requires_grad=True)
out = x * 3
out.backward()
x.grad
I get the result 3
.
But even with this, the problem is not disappearing. I have monitored the weights, and they seem all good until the NaN in the gradient appears (I have checked the magnitude of each of the weights individually). So I was wondering, is there a way I can “quickly” identify what is the source of the NaN in the gradient other than keeping track of every single computation in the model?
Thank you in advance.