I have a model that trains for >8 hours, and completes without issue. However, immediately after training, the GPU goes into P0 instead of P8 (where it was before), and causes an ERR reading on fan speed and power usage (after reading ERR it goes into P5). I’m forced to reboot the system every time this happens (which is consistently happening now, which is quite annoying). The model training program itself doesn’t use anything fancy, except setting CUDA_VISIBLE DEVICES (CUDA_VISIBLE_DEVICES make gpu disappear), but that thread doesn’t appear to be the same problem. Running nvidia-bug-report.sh gives Xid79 and GPU has fallen off the bus, which indicates an overheating error. However, GPU temps are stable at 60-70C when training and 30C at idle.
Since I am currently working from home, I can’t physically go check in on the machine, so I’m guessing as to what the issue might be. Some things that I’ve considered 1) GPU failure 2) motherboard/PCIE bus failure (I’ve disabled ASPM) 3) machine PSU failure 4) Some pytorch thing.
Has anyone ever experienced something similar caused by Pytorch? Is there something I’m missing software wise that would cause this? Most likely it is a hardware issue, but since I don’t have physical access to the machine I’m hoping I can fix this.
The exact setup of the machine is a dual socket Haswell system, with one 2070 on the PCIE lanes for one of the CPU’s (CPU 1 in NUMA) and a 2070 and 1070 attached to the PCIE lanes for the other CPU (CPU0 in NUMA). The failure is happening on the 2070 that shares a CPU with the 1070. The 1070 is on bus 04:00.0 and the 2070 is on bus 05:00.0 for that socket, so I did consider if some PCIE error was causing all but the first bus to lose connection (since the 1070 and other 2070 are both on the first PCIE bus). The machine PSU is 1300W, which should be more than sufficient for such a system.
What’s odd is running GPU-burn for 1 hr doesn’t cause issues, and training some example pytorch models from online for a few hours doesn’t cause issues either.