I was training my model with 3 Nvidia 2080 Ti on Ubuntu 16.04.
I used Nvidia Apex to use the full capacity of the gpus.
However, my pytorch training code hung up after a few epochs.
It worked well for one or two trainings.
I terminated the program and check the gpus with ‘nvidia-smi’.
It showed only two gpus (and it was really slow).
I found out that one of my gpus were dead.
My computer did not properly boot with that dead gpu (GUI didn’t show up).
I reinstalled OS of my computer to Ubuntu 18.04, reinstall drivers, but the problem still existed.
When I plugged that GPU on a Windows machine, it showed a 43 error code.
I was wondering if this problem is caused by Apex or did my graphics card had a problem.
Is there anyone who had a similar issue with Apex?