Gradient value is nan

Hi team,
Please follow the below code,

x.requires_grad = True
loss.backward()
print(x.grad)

output:-
tensor([ 1.0545e-05, 9.5438e-06, -8.3444e-06, …, nan,
nan, nan])
how to resolve this nan problem as i am unable to find the range of x.grad.
Please help me to resolve this issue.

4 Likes

Perhaps this is due to exploding gradients? I’d recommend you to first try gradient clipping and see how the training goes.

Thanks for the answer. Actually I am trying to perform an adversarial attack where I don’t have to perform any training. The strange thing happening is when I calculate my gradients over an original input I get tensor([0., 0., 0., …, nan, nan, nan]) as result but if I made very small changes to my input the gradients turn out to perfect in the range of tensor(0.0580) and tensor(-0.0501)..
Please help me to figure out this issue?

You could add torch.autograd.set_detect_anomaly(True) at the beginning of your script to get an error with a stack trace, which should point to the operation, which created the NaNs and which should help debugging the issue.

15 Likes

Hey, I am also having nan issues with the gradient. I tried gradient clipping, converted my relu functions to LeakyReLU. But no progress. Any suggestions would be great. Thanks.

Assuming that the forward pass does not create invalid outputs, you could register hooks to the parameters of the model and print the gradients during the backward pass in order to isolate which gradient gets the first invalid value.
This could make it easier to debug the issue further and check the operations used to create this gradient (e.g. are you dividing by a small number?).

Hey, any idea why does gradient become nan? I mean which mathematical operation cause it?

Invalid outputs can create NaN gradients:

x = torch.randn(1, requires_grad=True)
y = x / 0.
y = y / y
y.backward()
print(x.grad)
# tensor([nan])
1 Like

Yes, that is true. But my case is different. In my case, y is a valid output. But when I call y.backward(), one of the components of gradient is NaN, it traces back to the input during BP and most parameters become NaN.
I think that, since one of the components of gradient is NaN, so partial deriviative is NaN. So relative change produce NaN. Any idea how it can produce NaN? Maybe my conclusion is wrong :face_with_hand_over_mouth:

Use torch.autograd.detect_anomaly to check which layer is creating the invalid gradients, then check its operations and inputs.

1 Like

I’m rereading this part and am unsure how to understand it. Are you already seeing invalid values in y before calling backward? Or do you see the first invalid gradient somewhere later in the model?

Also, once you’ve narrowed down the layer or parameter where the first NaN is created, check if something could overflow (and then create NaNs somehow).

1 Like

Sorry, I just focused on NaN using torch.isnan(tensor).any(). Actually, I must also take into account torch.isinf(tensor).any(). Before loss is NaN, there is actually float('infinity') :

for images, targets in dataloader['train']:
        images, targets= images.to(device),  targets.to(device)
        outputs = model(images) # some elements is infinity
        loss = cross_entropy(outputs, targets) # loss is NaN
       .........

Simple test:

input_ = torch.tensor([[1, float('infinity'), 6],
                     [3, 5, 2],
                      [10, 12, 4]])
criterion = nn.CrossEntropyLoss()
loss = criterion(input_, torch.tensor([0, 2, 1])) # loss is NaN

If you training diverges and the output overflows, the Inf can easily become a NaN value, so you can also check the tensors to be valid via torch.isfinite.

1 Like