Hi, I also suffered from the same problem and I found the reason.
Dropout layer doesn’t cause NaN values.
There is a softmax layer right before dropout layer and the softmax layer causes NaN.
https://github.com/huggingface/transformers/blob/972fdcc77878cf7afcc8aef8979d6b4241005bb6/src/transformers/models/bert/modeling_bert.py#L355
Is there any suggestion which solve numerical unstability of softmax layer?
I’m implementing mixed precision training and I found that softmax layer is not stable with fp16 input.
Thanks!