You could add torch.autograd.set_detect_anomaly(True)
at the beginning of your script to get an error with a stack trace, which should point to the operation, which created the NaNs and which should help debugging the issue.
21 Likes