I train custom Bert model and observe the following problems.
These points are relevant for AMP case, training was fine for Float32 case.
- Attention mask starts to include -inf values and after dropout returns NaN for -inf.
drop = nn.Dropout() drop(torch.tensor([-np.inf, 1], dtype=torch.float))
- If attention has -inf then matmul between -inf and zero return NaN
How do I deal with those -inf values? Simply excluding or replacing them is not possible because it’s not a math operation.