I’m experiencing full system crashes after around 5-20 minutes of training, (no mouse/keyboard, no tty, no remote SSH. Requires hard reset). It’s fairly repeatable on my setup and on different models.
Here’s my system details: Ubuntu 18.04, Nvidia driver 410 (installed from ppa:graphics-drivers/ppa. Also tried 415), CUDA 10.0, cuDNN 7.4.2. 2 GPUs (1080, 2080Ti). Thermals (from nvidia-smi
) seem normal. PSU is 1500W. CPU is an AMD Threadripper 2950X & motherboard is ASUS Zenith Extreme with 128GB RAM.
Here’s what I’ve tried: Update Linux kernel to the latest version that will work with the drivers (4.19). Reseated all the GPUs & RAM. Tested each GPU individually. Fresh install of Ubuntu. Stock Python 3.6 & Anaconda 3.7. I tried 1 hour of 100% CPU usage just to make sure it’s not a CPU/RAM issue. Tried PyTorch 1.0 & nightly build. Gpu-burn is all fine (although I realise this is a very different load pattern to what PyTorch does).
I’ve posted a gist of my code here: https://gist.github.com/Anjum48/0ad193d4f408346c47533b835e86e10c
One thing I haven’t tested is to see if TensorFlow causes the same issues since there isn’t a stable build for Python 3.7 yet. This is a new system, but I know the 1080 was happily running TF code in Python 3.6 in an older machine.
I think the crash is occurring either in the Dataloader or autograd but I can’t tell which.
Does anyone have any ideas on how to diagnose this? It’s difficult since everything freezes, I’m not getting any error messages or anything weird in the Ubuntu log files to help the diagnosis.