Torch.version.cuda: 12.6

In Pycharm:

torch.cuda.is_available: True
torch.version.cuda: 12.6
torch.cuda.device_count: 1

In the terminal, I press nvidia-smi, too, everything is fine, driver 560 , cuda 12.16
It would seem that everything is fine, I start the training cycle and at 8-10 epochs (after 15 minutes) everything collapses, all systems show that cuda does not exist at all:

  return torch._C._cuda_getDeviceCount() > 0
torch.cuda.is_available: False
torch.version.cuda: 12.6
torch.cuda.device_count: 0

Process finished with exit code 0

In the terminal:

PS C:\Users\User> nvidia-smi
Unable to determine the device handle for GPU0000:01:00.0: GPU is lost.  Reboot
the system to recover this GPU

Installing:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

After restarting the computer, cuda appears everywhere again, but again it works for several cycles and that’s it. I have already reinstalled the drivers several times and the Cuda Toolkit v 12.6 and the PyTorch library in different ways, the result is always the same: the computer dies after a few cycles and comes to life for a while after restarting the PC. Are you completely desperate, asking for help?

This sounds like a system issue and is unrelated to PyTorch.

I’m already leaning towards a systemic problem. I was hoping maybe someone has a similar experience? I will be working on the following options now:

  • in a couple of months, a new powerful Tesla graphics card will arrive. I’ll plug it into the same PC and check it.
  • I’ll run a Linux VM on the same PC and try to run it there.
  • I’m waiting for options in this chat.

I’m not deeply familiar with Windows but on Linux I would recommend checking dmesg logs as it usually indicates failures via Xids to narrow down the issue, e.g. an PSU issue etc.