My GPU is dead while using Nvidia Apex

I was training my model with 3 Nvidia 2080 Ti on Ubuntu 16.04.
I used Nvidia Apex to use the full capacity of the gpus.
However, my pytorch training code hung up after a few epochs.
It worked well for one or two trainings.
I terminated the program and check the gpus with ‘nvidia-smi’.
It showed only two gpus (and it was really slow).

I found out that one of my gpus were dead.
My computer did not properly boot with that dead gpu (GUI didn’t show up).
I reinstalled OS of my computer to Ubuntu 18.04, reinstall drivers, but the problem still existed.
When I plugged that GPU on a Windows machine, it showed a 43 error code.

I was wondering if this problem is caused by Apex or did my graphics card had a problem.
Is there anyone who had a similar issue with Apex?

I’ve never seen this issue raised by using apex and suspect the GPU might have some hardware issues.
Depending on the opt_level you are using in apex.amp, we are e.g. patching some PyTorch methods to use FP16 instead of FP32 (whitelist/blacklist style) or transform the model’s parameters to FP16 and use master parameters (master gradients) etc.
apex does not manipulate the hardware in any way and just uses CUDA and PyTorch for e.g. mixed precision training.

How old is the GPU and how long was it working fine?

I just bought the gpus.
I’ve been using the gpu about two weeks and it ran fine for a few train sessions.

I’m not sure it is relevant, but there were some weird incidents while using Apex.
When I terminated a python script with ctrl-c, sometimes Apex did not fully terminate and some sub-processes existed. Those sub-processes kept held GPU memories, so I had to kill them manually. These incidents made me suspect Apex.

I agree that it might be a hardware problem, but I wonder if a high GPU utilization might harm a gpu.
Is it possible that utilizing GPU 99% for a long time can affect the hardware (e.g. overheat)?

Moreover, is there any other way to utilize gpu up to 90% instead of using Apex?
When I tested some codes, gpu utilization was around 50~70% and I’m not sure whether it is normal.
I wanted to increase it, so I ended up with Apex.

I’m not sure, if this is caused by apex or PyTorch, as I’ve seen this behavior using plain PyTorch. If I’m not mistaken, this should be fixed in the latest stable release.

If you didn’t overclocked the GPU, it should be fine. In case your device overheats, e.g. if your GPUs are packed tightly into the case, it should reduce its clock and shutdown as the last step.

It depends on your code and e.g. you might have a data loading bottleneck.
This post explains some workarounds.

Thank you very much! Your answers really helped me.