Random reboot during training

fdarmon · June 4, 2018, 8:24am

Hello,

I am trying to train a model on a Ubuntu 14.04 with a Nvidia 1080 ti, Cuda 8.0 and Pytorch 0.4. I have never been able to train an 200 epochs. The computer reboots after a random time from 1 to 90 epochs.

I have investigated the problem:

It is not a PSU max power problem because the computer can run other programs (with pytorch 0.3) with its 4 GPUs
Overheating: I have ran a script that monitor nvidia-smi every 0.05s and when the crash occurs, the temperature of the 4 GPUs is 78°, 51°, 43° and 34° and their power consumption is 90W, 13W, 11W and 9W (when training on GPU 0)
I have also monitored the CPU and GPU memory utilization and everything is stable
I have experienced the crash when training on GPU 0 and GPU 2 so it is not a specific hardware issue.
Similar to Automatically reboot when set cudnn.benchmark to True I set cudnn.benchmark = False and still got the crash.
I was able to complete the training multiple times on another computer with the same hardware (except a smaller PSU) and Ubuntu 16.04 and Cuda 9.1.

One possible explanation is the one of Reliably repeating pytorch system crash/reboot when using imagenet examples · Issue #3022 · pytorch/pytorch · GitHub large power variation causing crashes. I ran my program with a finer image resolution in order to have a constant 100% utilization of the GPU and I got a huge increase in power consumption (still with large power variation) but I experienced no crash. Since the crash appears after a random time up to 6 hours, maybe I have just not waited long enough.

Do you think this is the explanation ? The more Powerful PSU on this computer is in fact more sensible to power variations and shuts off. Do you think it is a software problem with Cuda 8 and pytorch 0.4 ?

Thank you for your help

EDIT: Add a plot the temperature for a training that crashed

danieldlongo · October 22, 2018, 3:48am

Did you ever solve this problem? I have a similar error where my computer restarts shortly after I start training. I have a 750w psu and only 1 gpu (1080ti) so I don’t think it is a power problem. Also, I did not see an increased wattage going to my gpu before it restarts.

fdarmon · October 22, 2018, 7:22am

No, I have stopped using that computer and used another one with Ubuntu 16.04 and Cuda 9.1. I never experienced a similar crash with this computer.

I think it was a specific hardware / driver / cuda compatibility issue.

Alejandro_Castaneira · July 27, 2019, 11:14am

I have a similar problem with a RTX 2080ti using cuda 10.1 and a 1100W power supply, it randomly crash after a couple seconds of training on the GPU.

AravindDoss · August 13, 2019, 4:16pm

Hi,
I have same issues. check this link

Did you find a fix?

Thanks

Aldebaran · July 31, 2020, 8:39pm

Hi,

I have same issue with the same GPU and Cuda.
Did you find a fix?

Saurabh_Bagalkar · September 16, 2020, 9:49pm

Exact same thing! Any solution?

omeryildirim · July 4, 2022, 1:23pm

I was facing similar issues. Even with small batch sizes in Pytorch the computer would reboot by itself. I removed a gpu but still no solution. I did the following and all my problem was solved.

sudo nvidia-smi -pm 1
sudo nvidia-smi -lgc 1400
sudo nvidia-smi -lmc 6500
sudo nvidia-smi -gtt 65
sudo nvidia-smi -cc 1
sudo nvidia-smi -pl 165

These settings are for the RTX 2080TI. Edit according to your own gpus.

My system:

HP Z800 Workstation
Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
PSU 850W
Ubuntu 20.04
2x RTX 2080TI
NVIDIA-SMI 510.73.05 with CUDA Version: 11.6

ptrblck · July 4, 2022, 8:37pm

Since limiting the power limit and clocks seems to help, I would also suggest to check the PSU as it seems to be too weak or faulty.

omeryildirim · July 5, 2022, 8:01am

Thanks, I gonna check it.