In the terminal, I press nvidia-smi, too, everything is fine, driver 560 , cuda 12.16
It would seem that everything is fine, I start the training cycle and at 8-10 epochs (after 15 minutes) everything collapses, all systems show that cuda does not exist at all:
return torch._C._cuda_getDeviceCount() > 0
torch.cuda.is_available: False
torch.version.cuda: 12.6
torch.cuda.device_count: 0
Process finished with exit code 0
In the terminal:
PS C:\Users\User> nvidia-smi
Unable to determine the device handle for GPU0000:01:00.0: GPU is lost. Reboot
the system to recover this GPU
After restarting the computer, cuda appears everywhere again, but again it works for several cycles and that’s it. I have already reinstalled the drivers several times and the Cuda Toolkit v 12.6 and the PyTorch library in different ways, the result is always the same: the computer dies after a few cycles and comes to life for a while after restarting the PC. Are you completely desperate, asking for help?
I’m not deeply familiar with Windows but on Linux I would recommend checking dmesg logs as it usually indicates failures via Xids to narrow down the issue, e.g. an PSU issue etc.
I have some kind of trouble with the GPU. I installed a new Tesla T4, it’s small in size, even the connectors from the power supply don’t connect to it, you just put it in PCI (at first I wanted to install a more modern and more powerful power supply). At the hardware level, there may be a problem with the motherboard, probably that’s all, although everything is fine with other jobs. Again after ± 10 epochs
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Addition:
Cuda issues an error when using convolutional neural networks (ResNet-152, VGG-19) with a dataset of approximately 224*224 images (5 classes of 500 images each).
I tried it on a recurrent neural network (nn.RNN) to predict the following letters. 100 epochs of learning just flew by in a minute with the Tesla T4, no mistakes.
Maybe there’s something in the code after all? Or is it because there are more calculations with pictures, for example? I would like to close this issue already.
The issue has been resolved for both RTX 2070 and Tesla T4.
Software error.
I corrected the code and the formation of the loader, since the pictures were broken, etc., in general, this could be a problem.
class SafeImageFolder(ImageFolder):
def __getitem__(self, index):
path, target = self.samples[index]
try:
sample = Image.open(path).convert("RGB")
if self.transform is not None:
sample = self.transform(sample)
return sample, target
except Exception as e:
print(f"[ERROR] Пропущен файл: {path}, ошибка: {e}")
return self[(index + 1) % len(self)] # попробовать следующий
# создание датасетов
# train_data1 = ImageFolder(root=r"C:\Users\User\PycharmProjects\ML_stepik_Dubinin\siz\siz_data",
# transform=transform_v2)
train_data1 = SafeImageFolder(root=r"C:\Users\User\PycharmProjects\ML_stepik_Dubinin\siz\siz_data",
transform=transform_v2)
Hardware error.
I have installed a number of third-party programs for diagnosing video card parameters (MSI Afterburner, TechPowerUp GPU-Z). Both video cards were overheating badly, and the cooling system was improved. Tesla in general almost boiled + 95C showed (it seems to be still alive)