Full system crash when using PyTorch

I’m getting a full system crash when training large models with PyTorch on a 2080 Ti.

It crashes faster when running larger models, where anything needing less than 4GB GPU memory can run for a few hours, while anything over 9GB crashes within 10-20 minutes.

This screams “hardware issue” and “overheating”, if not for the fact that everything runs fine in other frameworks.

There’s no crash when using:

The cuda_memtest allocates as much memory as it can, and exercises it, leaving the device at 100% utilization in nvidia-smi. It finds no issues, and doesn’t crash the system.

This issue with PyTorch has persisted across different versions of PyTorch (1.4, 1.5 and 1.6), different nvidia drivers (version 440 and 450), OS reinstalls (Linux Mint and Ubuntu 18.04), cuda versions (10.1 and 10.2). It happens on all code bases I’ve tried: MMdetection, AdelaiDet, WongKinYiu/PyTorch_YOLOv4, ultralytics/yolov5.

Looking at the temperature reading on the GPU and CPU does not show any temperature going particularly high. It can crash with GPU temperature below 70C.

When the crash happens, the screen will freeze for a few seconds, before the system reboots.

I haven’t dug deeply into what the above mentioned code bases have in common, but obviously neural net layers, and perhaps the data loading mechanisms.

Obviously I’m not the first to run PyTorch on a 2080 Ti. Yet something consistently causes problems when running large models over time, across various software configurations, but only with PyTorch. It’s as if the probability of a crash increases with model_size * time.

Any ideas on what could be going on here? Anything I could do to troubleshoot this?

1 Like

It rather sounds like a PSU issue. Could you check dmesg for CUDA XIDs after the crash?

I guess it could be the PSU. I figured the stability when using other libraries counted against this theory. It’s supposed to deliver 1000W, which should be plenty for a single GPU system. I’m not seeing XIDs in the logs, or really any reoccurring message preceding the crashes. (This is when looking at /var/log/kern.log and output from journalctl in Ubuntu 18.04.)

I’ve discovered that setting num_workers=0 on the PyTorch DataLoader makes things considerably more stable (although crashes can still happen after several hours). All four code bases that have crashed have relied on the PyTorch DataLoader. Of course, setting num_workers=0 also slows down execution, causing less stress on the GPU, power draw, etc, so I assume that is why this helps. I don’t really see how a DataLoader bug could bring down the entire system in any case.

It’s possible that the four PyTorch code bases just happen to be able to saturate the hardware better, thus causing the crash, for example through power draw. I suppose I should try very hard to get things to crash with other libraries, which would rule out PyTorch altogether as a part of the problem.

So far system crashes reported in this forum were isolated to hardware defects (most of the time the PSU was at fault). You could try to limit the power usage of your GPU via nvidia-smi, if your device supports it, and rerun the script.

Num_workers is CPU bound so something to try is running a CPU benchmark (e.g. Blender test) and see whether it crashes.

You could compare the GPU and CPU utilization between the two and maybe try to separate them in different processes, too (e.g. by just feeding the same random inputs to the GPU which eliminates dataloading as a bottleneck and then doing something else that takes all your CPU).
Note that only saturating either GPU and CPU might not see the problem if it is power draw.

Thanks everyone for your help, it’s much appreciated.

I’ve tried now running large matrix multiplications in a loop on both CPU and GPU. This leaves nvidia-smi showing ~250W out of 250W power draw on the GPU while htop shows ~100% utilization on all 24 CPU cores. I ran this for 1.5 hours, with no crash.

I also tried throttling GPU power via nvidia-smi to only use 200W, and then running the PyTorch neuralnet code that has crashed before (num_workers=8). This crashed after a couple of hours, which means it’s ran longer than it usually does. But I don’t know if this is because of lower power draw, slower execution, or if its just a fluke.

This leaves me somewhat more confident that it isn’t the PSU. Do you agree? Are there other tests I could do to exercise the PSU?

Besides running your “stress test” for a longer period of time, I don’t know if there are many more tests.
If possible, you could try to swap parts of the system, e.g. the PSU, GPU, RAM etc. one by one from another system and check, if it’s still failing. Also, maybe switching the PCIe port could also help. A defect in the PCIe connection would usually just drop the GPU, but hardware defects are not always “well behaved”.

Did you check your GPU temperature? I met your problem these days.Finally I checked it out.
In terminal,

watch -n 0.5 nvidia-smi -q -i 0,1 -d TEMPERATURE

Then,start your train. If GPU Current Temp > GPU Shutdown Temp, The system will crash

1 Like