Random freezes and CUDA errors

arc144 · August 25, 2020, 1:58am

Hi guys, I have a code that I’ve been using for a while but recently I updated my Nvidia driver (I use Linux) to 440xx and I started experiencing random freezes. My training goes well but usually somewhere in the first epoch it hangs forever, some other times it runs 8 epochs than hangs, others it simply freezes the entire computer, others it gives random device assertion error or CUDA errors (different runs different errors)…

I’ve already tried going back to the older driver, updating PyTorch to 1.6, using all different combinations of PyTorch and Cuda (Torch 1.4 + Cuda 10.2 on conda, Torch 1.6 + Cuda 10.2 on conda, Torch 1.6 + Cuda 11 on ArchLinux’s Python) but the issue persists…

Any help is much appreciated. Thanks!

ptrblck · August 25, 2020, 8:53am

Could you check dmesg for some reported errors/warning which could explain the system freezes?
Also, are you only seeing this behavior in PyTorch or any random CUDA program?

arc144 · August 25, 2020, 8:00pm

Sure. It seems to be PyTorch related because the other GPU usages are fine.

My dmesg is full of:

[  959.195648] pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
[  959.195768] pcieport 0000:00:03.0: AER: can't find device of ID0018
[  959.195769] pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
[  959.195898] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[  959.195902] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00001040/00002000
[  959.195903] pcieport 0000:00:03.0: AER:    [ 6] BadTLP                
[  959.195908] pcieport 0000:00:03.0: AER:    [12] Timeout               
[  959.195915] pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
[  959.195986] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[  959.195987] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00000040/00002000
[  959.195989] pcieport 0000:00:03.0: AER:    [ 6] BadTLP                
[  959.195992] pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03.0
[  959.195996] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[  959.195998] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00000040/00002000
[  959.196000] pcieport 0000:00:03.0: AER:    [ 6] BadTLP                
[  959.196007] pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03.0
[  959.196148] pcieport 0000:00:03.0: AER: can't find device of ID0018
[  959.196150] pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
[  959.196291] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[  959.196293] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00000040/00002000
[  959.196294] pcieport 0000:00:03.0: AER:    [ 6] BadTLP                
[  959.196300] pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0

-Edit: When it breaks I get a lot of EDAC sbridge: Seeking for: PCI ID 8086:6f6d

I also noticed that the training runs well for a while and then when the errors start to appear the time to finish the epoch suddenly increases a lot until eventually the program freezes.

ptrblck · August 26, 2020, 1:41am

Since a driver update seems to have caused the issues, you could try to update again to the latest one (450+) and check, if it would help. If not, I would recommend to create a post in the NVIDIA developer forum.

arc144 · August 26, 2020, 1:47am

I’ve tried this and didn’t work. However, I also did some tests with tensorflow and it also froze. To make sure it’s not an issue with the hardware, I’ve booted in windows (I have dual boot) and did a benchmark with GFXBench. The test ran ok and the results are compatible with other RTX 2080ti (reported by other users). So I believe it’s an issue with Linux, driver or cuda.

ptrblck · August 26, 2020, 1:49am

Yes, I also think it’s not PyTorch-related, but seems to be caused by the CUDA toolkit, driver, hardware or any interaction between them, which is why I would suggest to create a post for the NVIDIA driver team.