CUDA failed suddenly

Good morning!

I have been having a problem with Pytorch lately. I reformated the computer and installed everything I needed to work like torch. Then I ran many experiments which individialy use torch. They are working properly until torch, after some hours, suddenly says “CUDA driver initialization failed”.

The command nvidia-smi still working but something changed: the line under the name of the graphic card is not showing the power used (instead of 15W / 80W it says N/A / N/A). When I try to restart, the computer gets frozen so I have to force it. Then after a forced restart, everything is ok again.

I have used it in different conda environments, so, in principle, it is not the installation.

Do you know what to do in this case?

System: Ubuntu 22
Graphic Card: Nvidia 4070

Thank you very much in advance!
Sam.

The issue seems to be unrelated to PyTorch and points to your setup or driver so I would recommend asking in an NVIDIA discussion board.

Thank you for answering!

It is strange since cuda stops working just when torch is running. However, I’ll ask in Nvidia forums about it!

Thanks again!