What to do for non-finite warning in `clip_grad_norm`?

AlexisW · July 25, 2021, 5:16pm

I started to see this warning for a language model training

FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.

Is this an indicator that my model is not working well? And if so, is there any recommendation on what to change? Thanks!

(I am using Adam with weight decay)

ptrblck · July 26, 2021, 7:48am

This warning should indicate that some of the calculated gradients are non-finite (Inf or NaN most likely). I would claim it depends on your use case, if these invalid gradients are expected and if clipping them should be fine or if you would like to avoid them in the first place.

SimonW · July 26, 2021, 2:43pm

However, in case of Inf, clipping by norm means that all non-inf entries will be removed (i.e., zeroed), unless PyTorch does something specifically for this case.

AlexisW · July 30, 2021, 1:05am

@ptrblck @SimonW I am using BERT/large transformers, and this happens in the middle of training. Any insights based on this? Should I increase/drcrease learning rate/max_clip_norm/warmup steps etc.?

nitaifingerhut · August 1, 2021, 6:23am

Maybe your model diverged… try using a smaller learning rate or a lr scheduler and see if the gradients keeps diverging.

Allan_Jie · September 24, 2021, 4:16am

I also have this warning when update the PyTorch to 1.9

steven · November 1, 2021, 9:27am

Hi pal, could I know if you still have this issue? Any hint to solve it? I run into this problem lately and kinda stuck here. I am thinking if it is caused by gradient vanishing and try to solve it by adding the layer normalization. I am still monitoring the process, see if it will go well.

code-ash-IIT · January 27, 2024, 7:05pm

Hi, I am also encountering the same Warning,


clip_grad.py", line 50, in clip_grad_n
orm_
    f'The total norm of order {norm_type} for gradients from '
RuntimeError: The total norm of order 2.0 for gradients from `parameters` is non-finite, so it cannot be clipped. To disable this
 error and scale the gradients by the non-finite norm anyway, set `error_if_nonfinite=False`

On debugging I found out that, at first many of gradients become nan which further lead to nan in predictions and loss in next steps.

I tried reucing learning rate, It sometime works but not always.

This has happened after I modified my loss function.
At first, loss varied from[0-1000], this new loss varies from [{some_minimum}, 10k], I want to understand, what is the reason behind Nan values, and why nan and not inf if exploding gradient problem is there and what are other things to work on if I don’t want to modify loss function [currently there is gradient clipping {torch.nn.utils.clip_grad_norm_(self.model.parameters(), .25, error_if_nonfinite=True)}, lr_scheduler]

@ptrblck @nitaifingerhut

ptrblck · January 27, 2024, 7:30pm

Operations performed on an inf value might create NaNs if the output is undefined, e.g. as seen here:

x = torch.tensor(float("inf"))
y = x - x
print(y)
# tensor(nan)

Based on your description your gradients might indeed overflow as yous loss increased, so you might want to check which operation creates invalid gradients e.g. via the anomaly detection util.

code-ash-IIT · January 28, 2024, 5:49am

Below is the output with with torch.autograd.set_detect_anomaly(True):

[W python_anomaly_mode.cpp:104] Warning: Error detected in IndexBackward. Traceback of forward call that caused the error:
  File "/home/experiment.py", line 481, in <module>
    experiment('gym-experiment', variant=vars(args))
  File "/home/experiment.py", line 438, in experiment
    outputs = trainer.train_iteration(num_steps=variant['num_steps_per_iter'], iter_num=iter+1, print_logs=True)
  File "/home/training/trainer.py", line 31, in train_iteration
    train_loss = self.train_step()
  File "/home/training/seq_trainer.py", line 24, in train_step
    action_preds = action_preds.reshape(-1, act_dim)[attention_mask.reshape(-1) > 0]
 (function _print_stack)
 28%|████████████████████████▌                                                              | 2819/10000 [07:29<19:03,  6.28it/s]
Traceback (most recent call last):
  File "/home/experiment.py", line 481, in <module>
    experiment('gym-experiment', variant=vars(args))
  File "/home/experiment.py", line 438, in experiment
    outputs = trainer.train_iteration(num_steps=variant['num_steps_per_iter'], iter_num=iter+1, print_logs=True)
  File "/home/training/trainer.py", line 31, in train_iteration
    train_loss = self.train_step()
  File "/home/training/seq_trainer.py", line 119, in train_step
    loss.backward()
  File "/home/anaconda3/envs/project_3_7/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/anaconda3/envs/project_3_7/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: Function 'IndexBackward' returned nan values in its 0th output.

But I am not able to infer anything from here, action_preds only become Nan, when params have become Nan,
on further debugging I found out the first layer to encounter Nan value is:
transformer.h0.ln_1.weight
which stands for (ln_1): LayerNorm((128,), eps=1e-05, elementwise_affine=True) in my model configuration, Base model that I am using is GPT2.

Can you suggest something based on these insights, what might be wrong, like how can I use the loss function without encountering this?

code-ash-IIT · January 28, 2024, 7:54am

Resolved, Thanks!

Root cause:
For my loss function to prevent division by zero, I defined epsilon = 1e-8, changing this value to 1e-4 resolved the nan values in grads.

This is probably due to default float32 dtype of model configs, right?

(Note: this value was used in norm and would probably be squared when dividing, causing overflow, [but only thing I don’t understand is while printing loss, it never reached to such a high value.])