Dear PyTorch community,
I’m facing a rather weird problem I’m not sure how to debug anymore - my only guess is a hardware related issue.
I’ve been successfully training a not-so-complex CNN on a 1080Ti (complex-yolo), trying to improve it.
On a new PC with a 2080Ti, I got crashes in training. After a lot of debugging, I noticed my model suddenly blows up (outputing Infs and NaNs) early in training for no apparent reason - there were no issues with the input data, or targets. Reverting code to states I know trained properly, lowering learning rate, simplifying things, nothing helped. I ran with a fixed seed and deterministic mode to debug the issue.
In desperation I swapped out the 2080Ti for the old 1080Ti, ran the code without any other changes and could successfully train. I’ve had no other issues with the 2080Ti. I can do inference on trained models without issues, I can resume training.
I’m not even sure what info is helpful for debugging this, or where to start. I used the exact same system with both cards:
- Nvidia Driver Version: 515.43.04
- System CUDA 11.4,
- Python Environment:
What can I look into to figure out if there’s something wrong with my hardware - or if there’s some code compatibility issue that leads divergence in training?