Hi guys, I have a code that I’ve been using for a while but recently I updated my Nvidia driver (I use Linux) to 440xx and I started experiencing random freezes. My training goes well but usually somewhere in the first epoch it hangs forever, some other times it runs 8 epochs than hangs, others it simply freezes the entire computer, others it gives random device assertion error or CUDA errors (different runs different errors)…
I’ve already tried going back to the older driver, updating PyTorch to 1.6, using all different combinations of PyTorch and Cuda (Torch 1.4 + Cuda 10.2 on conda, Torch 1.6 + Cuda 10.2 on conda, Torch 1.6 + Cuda 11 on ArchLinux’s Python) but the issue persists…
Could you check dmesg for some reported errors/warning which could explain the system freezes?
Also, are you only seeing this behavior in PyTorch or any random CUDA program?
-Edit: When it breaks I get a lot of EDAC sbridge: Seeking for: PCI ID 8086:6f6d
I also noticed that the training runs well for a while and then when the errors start to appear the time to finish the epoch suddenly increases a lot until eventually the program freezes.
Since a driver update seems to have caused the issues, you could try to update again to the latest one (450+) and check, if it would help. If not, I would recommend to create a post in the NVIDIA developer forum.
I’ve tried this and didn’t work. However, I also did some tests with tensorflow and it also froze. To make sure it’s not an issue with the hardware, I’ve booted in windows (I have dual boot) and did a benchmark with GFXBench. The test ran ok and the results are compatible with other RTX 2080ti (reported by other users). So I believe it’s an issue with Linux, driver or cuda.
Yes, I also think it’s not PyTorch-related, but seems to be caused by the CUDA toolkit, driver, hardware or any interaction between them, which is why I would suggest to create a post for the NVIDIA driver team.