RuntimeError: Function 'MulBackward0' returned nan values in its 0th output

thanks for the quick response @albanD

some background:
1)we are using pytorch based mmdetection framework, faster-rcnn with FPN and res50 backbone.
2)the problem is when training with many more epochs, nan may occur. we are sure the dataset is fine, and there is no nan issue using tensorflow based counterpart. It is not very easy to repo, although we alreay set determinism config. (faster rcnn training examples are still randomly selected in each iteration).

we print the key tensors in each iteraction, at some point, the value became bigger and bigger and finally nan:(sorry, only 1 file is allowed to upload)

the tensors in the red box are the res50 output feature maps. however at this point detect_anomaly did not report any error, and the forward pass is continuing…and the exception is not triggered until backward phase.

my questions:
1)why forward computation can continue when nan already occurs?
2)any suggestion to debug this issue?
thanks.