Clip_grad_norm_() returns nan

Hi everyone

I’m training a model using torch and the clip_grad_norm_ function is returning a tensor with nan:
tensor(nan, device=‘cuda:0’)
Is there any specific reason why this would happen? Thanks for the help.

Hi,

This will might happen if the norm of your Tensors is 0? Or if it has a single element?

Please excuse my late response. So the the tensor more than one element but I did notice that the elements in the tensor are very close to zero. Could this also cause the norm to be nan? And how would I get around this? Thanks

@albanD can correct me if I’m wrong but clip_grad_norm_ is an in-place operation and doesn’t return anything (None) which might be implicitly cast to nan. So use it like this (and do not assign it to anything):

clip_grad_norm_(model.parameters(), 1.0)

I’m not sure, from the doc it does modify the weights inplace but also returns the total norm.

@Maks_Botlhale Which norm are you using? This is most likely due to the content of your weights yes :confused:

Hi

I’m using norm_type=2. Yes, the clip_grad_norm_(model.parameters(), 1.0) function does return the total_norm and it’s this total norm that’s nan.

Is any element in any parameter nan (or inf) by any chance? You can use p.isinf().any() to check.

I just checked for that, none of the elements in parameters are infinite. See the screenshot below. I tried decreasing the learning rate and that didn’t help; some people suggested changing the dropout rate, that also didn’t help. I also noticed that the validation loss is also nan.

This is surprising…
The clip_grad_norm_ function is pretty simple and is there: https://github.com/pytorch/pytorch/blob/1c6ace87d127f45502e491b6a15886ab66975a92/torch/nn/utils/clip_grad.py#L25-L41
Can you try to copy paste that in your code and check it gives nan as well? Then you can add some prints there to see when the nan appears :slight_smile:

I copied and pasted that as suggested, and I am still getting nan values when it’s calculating the total norm. Line 36 of the code I copied calculates the total norm as:
total_norm = torch.norm(torch.stack([torch.norm(p.grad.detach(), norm_type).to(device) for p in parameters]), norm_type)

I did the p.grad.detach() function on a separate line and I noticed that’s where the nan values start popping up.

Ho right (sorry I missed that…). It computes the grad norm, not the Tensors norm!
You need to check if the gradients of the parameters contain nans: p.grad.isinf().any()

Yes, that function also returns False. See the screenshot below.

Well if you said in your comment above that p.grad.detach() has nan, then the grad must have nans already.

You can try this quite simple example, maybe you can find a solution:

import torch
x = torch.tensor([1., 2.])
x.grad = torch.tensor([0.4, float("inf")])
torch.nn.utils.clip_grad_norm_(x, 5)
print(x.grad)