I started to see this warning for a language model training
FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
Is this an indicator that my model is not working well? And if so, is there any recommendation on what to change? Thanks!
This warning should indicate that some of the calculated gradients are non-finite (Inf or NaN most likely). I would claim it depends on your use case, if these invalid gradients are expected and if clipping them should be fine or if you would like to avoid them in the first place.
However, in case of Inf, clipping by norm means that all non-inf entries will be removed (i.e., zeroed), unless PyTorch does something specifically for this case.
@ptrblck@SimonW I am using BERT/large transformers, and this happens in the middle of training. Any insights based on this? Should I increase/drcrease learning rate/max_clip_norm/warmup steps etc.?
Hi pal, could I know if you still have this issue? Any hint to solve it? I run into this problem lately and kinda stuck here. I am thinking if it is caused by gradient vanishing and try to solve it by adding the layer normalization. I am still monitoring the process, see if it will go well.