Model trains on 1080Ti but quickly diverges on 2080Ti

Jacob_Lambert · June 1, 2022, 8:33am

Dear PyTorch community,

I’m facing a rather weird problem I’m not sure how to debug anymore - my only guess is a hardware related issue.

I’ve been successfully training a not-so-complex CNN on a 1080Ti (complex-yolo), trying to improve it.
On a new PC with a 2080Ti, I got crashes in training. After a lot of debugging, I noticed my model suddenly blows up (outputing Infs and NaNs) early in training for no apparent reason - there were no issues with the input data, or targets. Reverting code to states I know trained properly, lowering learning rate, simplifying things, nothing helped. I ran with a fixed seed and deterministic mode to debug the issue.

In desperation I swapped out the 2080Ti for the old 1080Ti, ran the code without any other changes and could successfully train. I’ve had no other issues with the 2080Ti. I can do inference on trained models without issues, I can resume training.

I’m not even sure what info is helpful for debugging this, or where to start. I used the exact same system with both cards:

Nvidia Driver Version: 515.43.04
System CUDA 11.4,
Python Environment:
- pytorch=1.5.0=py3.6_cuda10.2.89_cudnn7.6.5_0
- cudatoolkit=10.2.89=hfd86e86_1

What can I look into to figure out if there’s something wrong with my hardware - or if there’s some code compatibility issue that leads divergence in training?

Thank you.

eqy · June 1, 2022, 4:49pm

This certainly seems unexpected—I would check if this is still visible on a more current version of PyTorch (e.g., >= 1.11) on the off chance that it’s a bug that has been already fixed. If it’s still visible I would see if there’s a specific layer or part of the model where the outputs diverge substantially.