How to fix NaN in the Bert layer?

Hello everyone,

I train custom Bert model and observe the following problems.

These points are relevant for AMP case, training was fine for Float32 case.

  1. Attention mask starts to include -inf values and after dropout returns NaN for -inf.
drop = nn.Dropout()
drop(torch.tensor([-np.inf, 1], dtype=torch.float))
  1. If attention has -inf then matmul between -inf and zero return NaN

How do I deal with those -inf values? Simply excluding or replacing them is not possible because it’s not a math operation.

Is it right use torch.nan_to_num() after each operation?

Generally when there are NaNs or Inf values in a given training step, it is not possible to “recover” from the training step; a common practice is to simply reject or skip the weight update of that step to avoid propagating the issue to the model weights (so nan_to_num wouldn’t really help). However, if you are finding that the training is consistently producing NaN values, then there may be further tuning necessary in the training pipeline (e.g., gradient scaling, etc.)

1 Like

The -inf should not be multiplied with anything directly but only go into softmax. Typically attention dropout is applied afterwards (at least in bert/gpt). Your application of dropout here is very nonstandard.

Best regards