Generally when there are NaNs or Inf values in a given training step, it is not possible to “recover” from the training step; a common practice is to simply reject or skip the weight update of that step to avoid propagating the issue to the model weights (so nan_to_num wouldn’t really help). However, if you are finding that the training is consistently producing NaN values, then there may be further tuning necessary in the training pipeline (e.g., gradient scaling, etc.)
The -inf should not be multiplied with anything directly but only go into softmax. Typically attention dropout is applied afterwards (at least in bert/gpt). Your application of dropout here is very nonstandard.