Torch.version.cuda: 12.6

In Pycharm:

torch.cuda.is_available: True
torch.version.cuda: 12.6
torch.cuda.device_count: 1

In the terminal, I press nvidia-smi, too, everything is fine, driver 560 , cuda 12.16
It would seem that everything is fine, I start the training cycle and at 8-10 epochs (after 15 minutes) everything collapses, all systems show that cuda does not exist at all:

  return torch._C._cuda_getDeviceCount() > 0
torch.cuda.is_available: False
torch.version.cuda: 12.6
torch.cuda.device_count: 0

Process finished with exit code 0

In the terminal:

PS C:\Users\User> nvidia-smi
Unable to determine the device handle for GPU0000:01:00.0: GPU is lost.  Reboot
the system to recover this GPU

Installing:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

After restarting the computer, cuda appears everywhere again, but again it works for several cycles and that’s it. I have already reinstalled the drivers several times and the Cuda Toolkit v 12.6 and the PyTorch library in different ways, the result is always the same: the computer dies after a few cycles and comes to life for a while after restarting the PC. Are you completely desperate, asking for help?

This sounds like a system issue and is unrelated to PyTorch.

I’m already leaning towards a systemic problem. I was hoping maybe someone has a similar experience? I will be working on the following options now:

  • in a couple of months, a new powerful Tesla graphics card will arrive. I’ll plug it into the same PC and check it.
  • I’ll run a Linux VM on the same PC and try to run it there.
  • I’m waiting for options in this chat.

I’m not deeply familiar with Windows but on Linux I would recommend checking dmesg logs as it usually indicates failures via Xids to narrow down the issue, e.g. an PSU issue etc.

I have some kind of trouble with the GPU. I installed a new Tesla T4, it’s small in size, even the connectors from the power supply don’t connect to it, you just put it in PCI (at first I wanted to install a more modern and more powerful power supply). At the hardware level, there may be a problem with the motherboard, probably that’s all, although everything is fine with other jobs.
Again after ± 10 epochs :slight_smile:
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Can someone else tell me?

Addition:
Cuda issues an error when using convolutional neural networks (ResNet-152, VGG-19) with a dataset of approximately 224*224 images (5 classes of 500 images each).
I tried it on a recurrent neural network (nn.RNN) to predict the following letters. 100 epochs of learning just flew by in a minute with the Tesla T4, no mistakes.
Maybe there’s something in the code after all? Or is it because there are more calculations with pictures, for example? I would like to close this issue already.

Here is the code of the model on which cuda is disabled after approximately 10 epochs:

weights_resnet152 = models.ResNet152_Weights.DEFAULT
model = models.resnet152(weights=weights_resnet152)

# transforms_res_152 = weights_resnet152.transforms()

# model.requires_grad_(False)
for param in model.parameters():
    param.requires_grad = False

model.fc = nn.Linear(2048, 3)

model = model.to(device)

loss_model = nn.CrossEntropyLoss()
opt = torch.optim.Adam(model.fc.parameters(), lr=0.001, weight_decay=0.001)  
lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(opt,
                                                          mode='min',
                                                          factor=0.01,
                                                          patience=50
                                                          )  # patience увеличить ??????

_______________________________________________________________________
# _______________________________________________________________________________________________________
EPOCHS = 150
train_loss = []
train_acc = []
val_loss = []
val_acc = []
lr_list = []
best_loss = None
threshold = 0.05  

for epoch in range(EPOCHS):
    # тренировка модели
    model.train()
    true_answer = 0
    running_train_loss = []
    train_loop = tqdm(train_loader, leave=False)
    for x, targets in train_loop:
        x = x.to(device)
        # (batch_size,int) --> (batch_size,float32)
        targets = targets.reshape(-1).to(torch.int32)
        targets = torch.eye(3)[targets].to(device) 

        pred = model(x)
        loss = loss_model(pred, targets)

        opt.zero_grad()
        loss.backward()
        opt.step()

        running_train_loss.append(loss.item())

        mean_train_loss = sum(running_train_loss) / len(running_train_loss)

        train_loop.set_description(f"Epoch [{epoch + 1}/{EPOCHS},train_loss={mean_train_loss:.4f}")

        true_answer += (pred.argmax(dim=1) == targets.argmax(dim=1)).sum().item()

    
    running_train_acc = true_answer / len(train_data)
    
    train_loss.append(mean_train_loss)
    train_acc.append(running_train_acc)

    
    model.eval()
    with torch.no_grad():
        true_answer = 0
        running_val_loss = []
        for x, targets in val_loader:
            x = x.to(device)
            # (batch_size,int) --> (batch_size,float32)
            targets = targets.reshape(-1).to(torch.int32)
            targets = torch.eye(3)[targets].to(device) 

            
            pred = model(x)
            loss = loss_model(pred, targets)

            running_val_loss.append(loss.item())
            mean_val_loss = sum(running_val_loss) / len(running_val_loss)
            # train_loop.set_description(f"Epoch [{epoch + 1}/{EPOCHS},train_loss={mean_train_loss:.4f}")
            true_answer += (pred.argmax(dim=1) == targets.argmax(dim=1)).sum().item()

        
        running_val_acc = true_answer / len(val_data)

        
        val_loss.append(mean_val_loss)
        val_acc.append(running_val_acc)

    
    lr_scheduler.step(mean_val_loss)
    lr = lr_scheduler.get_last_lr()[0]

    print(f"Epoch [{epoch + 1}/{EPOCHS},train_loss={mean_train_loss:.4f},"
          f"train_acc={running_train_acc:.4f},val_loss={mean_val_loss:.4f},"
          f"train_acc={running_val_acc:.4f}, lr={lr}")

    
    if best_loss is None:
        best_loss = mean_val_loss
    if best_loss - mean_val_loss > best_loss * threshold:
        best_loss = mean_val_loss

        # torch.save(model.state_dict(), f"model_state_dict_epoch_{epoch + 1}.pt")
        # torch.save(model, f"modelVGG19_ITOG_epoch_{epoch + 1}_val_{mean_val_loss:.4f}_train_{mean_train_loss:.4f}.pt")
        torch.save(model.state_dict(),
                   f"model_state_ResNet152_ROST_epoch_{epoch + 1}_val_{mean_val_loss:.4f}_train_{mean_train_loss:.4f}.pt")
        

    
    if epoch == (EPOCHS - 1):
        # torch.save(model, f"modelVGG19_ITOG_epoch_{epoch + 1}_val_{mean_val_loss:.4f}_train_{mean_train_loss:.4f}.pt")
        torch.save(model.state_dict(),
                   f"model_state_ResNet152_ROST_epoch_{epoch + 1}_val_{mean_val_loss:.4f}_train_{mean_train_loss:.4f}.pt")

The issue has been resolved for both RTX 2070 and Tesla T4.

  1. Software error.
    I corrected the code and the formation of the loader, since the pictures were broken, etc., in general, this could be a problem.
class SafeImageFolder(ImageFolder):
    def __getitem__(self, index):
        path, target = self.samples[index]
        try:
            sample = Image.open(path).convert("RGB")
            if self.transform is not None:
                sample = self.transform(sample)
            return sample, target
        except Exception as e:
            print(f"[ERROR] Пропущен файл: {path}, ошибка: {e}")
            return self[(index + 1) % len(self)]  # попробовать следующий

# создание датасетов
# train_data1 = ImageFolder(root=r"C:\Users\User\PycharmProjects\ML_stepik_Dubinin\siz\siz_data",
#                           transform=transform_v2)
train_data1 = SafeImageFolder(root=r"C:\Users\User\PycharmProjects\ML_stepik_Dubinin\siz\siz_data",
                              transform=transform_v2)
  1. Hardware error.
    I have installed a number of third-party programs for diagnosing video card parameters (MSI Afterburner, TechPowerUp GPU-Z). Both video cards were overheating badly, and the cooling system was improved. Tesla in general almost boiled + 95C showed (it seems to be still alive)