I have a transformer model with 10 % dropout at the positional encoding and 20% dropout at both encoder and decoder layers.
It works well with this setting in both train and test sections
When I remove the dropout at positional encoding layer or increase it to 15% it still works well in the training section but after 60 epochs or so the encoder starts delivering nan values in test, while training is still working well.
I have an output with sequence length of 1, and thus a tgt mask [1, 1] and an src mask [1, sequence_length], with no padding mask.
Any idea what can cause nan values after 60 epochs? or its relationship between positional encoding dropout?