Getting Nans from dropout layer

Hi, I also suffered from the same problem and I found the reason.

Dropout layer doesn’t cause NaN values.

There is a softmax layer right before dropout layer and the softmax layer causes NaN.
https://github.com/huggingface/transformers/blob/972fdcc77878cf7afcc8aef8979d6b4241005bb6/src/transformers/models/bert/modeling_bert.py#L355

Is there any suggestion which solve numerical unstability of softmax layer?

I’m implementing mixed precision training and I found that softmax layer is not stable with fp16 input.

Thanks!