After around 60 epochs, the encoder starts delivering nan values in transformer model

Hi everyone,
I have a transformer model with 10 % dropout at the positional encoding and 20% dropout at both encoder and decoder layers.
It works well with this setting in both train and test sections
When I remove the dropout at positional encoding layer or increase it to 15% it still works well in the training section but after 60 epochs or so the encoder starts delivering nan values in test, while training is still working well.

I have an output with sequence length of 1, and thus a tgt mask [1, 1] and an src mask [1, sequence_length], with no padding mask.

Any idea what can cause nan values after 60 epochs? or its relationship between positional encoding dropout?

Thanks

Maybe you could try to use anomaly detection to raise an error if the first invalid gradient is created, which might then help isolating which layer failed first.
You could use torch.autograd.detect_anomaly and run the forward/backward pass in this context manager.

@ ptrblck
Thanks.
I tried this, but as I said, the training section works just fine and it is only in the test section that I see “nan” results, even in the test the first 60 epochs work well. and after that it starts to give "nan results for loss.
anyways, I ran it and it did not pick anything.
do you have any idea what can be the source of such error?