Model performance significantly drop when updating from pytorch==1.10.1+cu111 to pytorch==1.12.1+cu113 on a Linux machine with cuda==11.4 installed

I was running pytorch on a Ubuntu 20.04 machine that was equipped with a cudatoolkit==11.4. Previously I installed pytorch==1.10.1+cu111 and everything goes fine. However, yesterday I reinstalled pytorch with pytorch==1.12.1+cu113 and the model performance drops 10 percentage.

There are two things to note:

  1. not all the models’ performance dropped, only several of them
  2. I’m not training any new model, I just tested the existing checkpoint with the same code and input.

After reinstalling the pytorch==1.10.1+cu111, everything is fine again.

Anyone can tell the reason? Or providing some resources of the compatability of different cuda version?

Could you describe your use case in more detail and share the model definition if possible?
It would also be interesting to see outputs of the model in eval() mode for a constant input (e.g. torch.ones) and see what multiple forward passes return in both PyTorch versions.

Good point. However, I have removed the torch==1.12.1+cu113 env; I may reinstall it but it should be a couple of days later.