Full system crash when using PyTorch 1.0

I’m experiencing full system crashes after around 5-20 minutes of training, (no mouse/keyboard, no tty, no remote SSH. Requires hard reset). It’s fairly repeatable on my setup and on different models.

Here’s my system details: Ubuntu 18.04, Nvidia driver 410 (installed from ppa:graphics-drivers/ppa. Also tried 415), CUDA 10.0, cuDNN 7.4.2. 2 GPUs (1080, 2080Ti). Thermals (from nvidia-smi) seem normal. PSU is 1500W. CPU is an AMD Threadripper 2950X & motherboard is ASUS Zenith Extreme with 128GB RAM.

Here’s what I’ve tried: Update Linux kernel to the latest version that will work with the drivers (4.19). Reseated all the GPUs & RAM. Tested each GPU individually. Fresh install of Ubuntu. Stock Python 3.6 & Anaconda 3.7. I tried 1 hour of 100% CPU usage just to make sure it’s not a CPU/RAM issue. Tried PyTorch 1.0 & nightly build. Gpu-burn is all fine (although I realise this is a very different load pattern to what PyTorch does).

I’ve posted a gist of my code here: https://gist.github.com/Anjum48/0ad193d4f408346c47533b835e86e10c

One thing I haven’t tested is to see if TensorFlow causes the same issues since there isn’t a stable build for Python 3.7 yet. This is a new system, but I know the 1080 was happily running TF code in Python 3.6 in an older machine.

I think the crash is occurring either in the Dataloader or autograd but I can’t tell which.

Does anyone have any ideas on how to diagnose this? It’s difficult since everything freezes, I’m not getting any error messages or anything weird in the Ubuntu log files to help the diagnosis.

1 Like

Have you checked you don’t have a CPU RAM or GPU RAM leak in your code? These could be cause by a full swap? Or that you don’t fill up your disk with tmp files?

From vtop only 10% of the RAM is used and loads of space on the SSD. Swap is at 0%

Not sure what happens then :confused:

You can try to set the number of workers for the dataloader to 0 to check if it’s that.

I think I found the issue. The other day after a reboot I noticed that it was much more stable than ever before. In nvidia-smi the order of the GPUs had flipped so that the RTX was GPU0, and the 1080 was GPU1 (the 1080 is in the first PCIe slot and RTX in slot 3).

I had read before that nvidia-smi usually allocates GPU0 by speed, not by PCIe location, so it looked like this hadn’t happened properly after the driver install/reinstalls. I don’t know why it randomly decided to switch them and why after so many reboots it chose that now was the right time to do it, but i’m glad it did since everything is fine now. Thanks for your help though!

1 Like