I’m getting a full system crash when training large models with PyTorch on a 2080 Ti.
It crashes faster when running larger models, where anything needing less than 4GB GPU memory can run for a few hours, while anything over 9GB crashes within 10-20 minutes.
This screams “hardware issue” and “overheating”, if not for the fact that everything runs fine in other frameworks.
There’s no crash when using:
The cuda_memtest allocates as much memory as it can, and exercises it, leaving the device at 100% utilization in nvidia-smi. It finds no issues, and doesn’t crash the system.
This issue with PyTorch has persisted across different versions of PyTorch (1.4, 1.5 and 1.6), different nvidia drivers (version 440 and 450), OS reinstalls (Linux Mint and Ubuntu 18.04), cuda versions (10.1 and 10.2). It happens on all code bases I’ve tried: MMdetection, AdelaiDet, WongKinYiu/PyTorch_YOLOv4, ultralytics/yolov5.
Looking at the temperature reading on the GPU and CPU does not show any temperature going particularly high. It can crash with GPU temperature below 70C.
When the crash happens, the screen will freeze for a few seconds, before the system reboots.
I haven’t dug deeply into what the above mentioned code bases have in common, but obviously neural net layers, and perhaps the data loading mechanisms.
Obviously I’m not the first to run PyTorch on a 2080 Ti. Yet something consistently causes problems when running large models over time, across various software configurations, but only with PyTorch. It’s as if the probability of a crash increases with model_size * time.
Any ideas on what could be going on here? Anything I could do to troubleshoot this?