I have the same problems with transformers. See Transformer model doesn't improve even when fed the same single example over and over
Did you ever figure out a solution to your problem?
I have the same problems with transformers. See Transformer model doesn't improve even when fed the same single example over and over
Did you ever figure out a solution to your problem?