CUDA Error: no kernel image is available for execution on the device

MaxM · April 29, 2021, 11:09am

Hello together,

I am quite new to the topic of neural networks.
I`ve build a very simple CNN with the help of YouTube Tutorials and now I have a problem to run my code on the GPU with CUDA. If I set the device to cpu, my code works perfectly well. But if I try to run in on the GPU I get the below displayed error. The GPU ist Nvidia Tesla K20m. The command torch.cuda.is_available() is true. The code is running on Python 3.9. The operating system is RedHat Enterprise Linux (RHEL) / CentOS 7. My input images are 144 x 144 with one channel. The images object that goes into my model is torch.Size([32, 1, 144, 144]). Do you know what could be the problem?

Python Script:

import torch
import torch.nn as nn # All neural network modules, nn.Linear, nn.Conv2d, BatchNorm, Loss functions
import torchvision.transforms as transforms # Transformations we can perform on our dataset
import torchvision
import torch.nn.functional as F
from torch.utils.data import (Dataset, DataLoader) # Gives easier dataset management and creates mini batches
import matplotlib.pyplot as plt
import pandas as pd
from skimage import io
import numpy as np
import os

Class for custom dataset

class SurfaceDataset(Dataset):
def init(self, csv_file, root_dir, transform=None):
self.annotations = pd.read_csv(csv_file)
self.root_dir = root_dir
self.transform = transform

def __len__(self):
    return len(self.annotations)

def __getitem__(self, index):
    img_path = os.path.join(self.root_dir, self.annotations.iloc[index, 0])
    image = io.imread(img_path)
    y_label = torch.tensor(int(self.annotations.iloc[index, 1]))

    if self.transform:
        image = self.transform(image)

    return (image, y_label)

Set device

device = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)

Print information about the usage of cuda

if torch.cuda.is_available():
print(“CUDA is available”)
print(f"Number of available GPU is {torch.cuda.device_count()}")
else:
print(“CUDA isn’t available”)

Hyperparameters

num_classes = 2
learning_rate = 1e-3
batch_size = 32 # Normally the batch-size should be something of 2^x with x = [0, 1, 2, 3, 4, …]
num_epochs = 2

print("-------------- Hyperparameter Settings --------------")
print(f"Number of classes: {num_classes}")
print(f"Learning rate: {learning_rate}")
print(f"Batch-size: {batch_size}")
print(f"Number of epochs: {num_epochs}")

Load Data

dataset = SurfaceDataset(
csv_file=“Klassifizierung.csv”,
root_dir=“Bilder”,
transform=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5), (0.5))]))

classes = (‘Defekte Oberfläche’, ‘Defektfreie Oberfläche’)

train_set, test_set = torch.utils.data.random_split(dataset, [800, 100]) # Set the ratio of train and test images
train_loader = DataLoader(dataset=train_set, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_set, batch_size=batch_size, shuffle=False)

def imshow(img):
img = img / 2 + 0.5
npimg = img.numpy()
plt.imshow(np.transpose(npimg, (1, 2, 0)))
plt.show() # Display all open figures.

get some random training images

dataiter = iter(train_loader)
images, labels = dataiter.next()

show images

imshow(torchvision.utils.make_grid(images)) # Generates one picture that contains several pictures with the make_grid command | make_grid also converts from 1 channel to 3 channels

class ConvNet(nn.Module):
def init(self):
super(ConvNet, self).init()
self.conv1 = nn.Conv2d(1, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 33 * 33, 120)
self.fc2 = nn.Linear(120, 30)
self.fc3 = nn.Linear(30, 2)

def forward(self, x):
    # x -> batch_size, input_channels, width of the image, heigth of the image | batch_size, 1, 144, 144
    x = self.pool(F.relu(self.conv1(x)))  # -> batch_size, 6, 70, 70
    x = self.pool(F.relu(self.conv2(x)))  # -> batch_size, 16, 33, 33
    x = x.view(-1, 16 * 33 * 33)  # -> batch_size, 16 * 33 * 33
    x = F.relu(self.fc1(x))  # -> batch_size, 120
    x = F.relu(self.fc2(x))  # -> batch_size, 84
    x = self.fc3(x)  # -> batch_size, num_classes
    return x

model = ConvNet().to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) # import torch.optim for all Optimization algorithms, SGD, Adam, etc.

n_total_steps = len(train_loader) # Total amount of images in the test_set
for epoch in range(num_epochs):
for i, (images, labels) in enumerate(train_loader):

    images = images.to(device)  # torch.Size([batch_size, 1, 144, 144])
    labels = labels.to(device)  # torch.Size([batch_size])

    # Forward pass
    outputs = model(images)
    loss = criterion(outputs, labels)

    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if (i + 1) % 1 == 0:
        print(f'Epoch [{epoch + 1}/{num_epochs}], Step [{i + 1}/{n_total_steps}], Loss: {loss.item():.4f}')

Safe the trained weights of the neural network

print(‘Finished Training’)
PATH = ‘./cnn.pth’
torch.save(model.state_dict(), PATH)

with torch.no_grad():
n_correct = 0
n_samples = 0
n_class_correct = [0 for i in range(num_classes)]
n_class_samples = [0 for i in range(num_classes)]
for images, labels in test_loader:
images = images.to(device)
labels = labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs, 1) # torch.max returns (value ,index)
n_samples += labels.size(0)
n_correct += (predicted == labels).sum().item()

    for i in range(len(labels)):
        label = labels[i]
        pred = predicted[i]
        if (label == pred):
            n_class_correct[label] += 1
        n_class_samples[label] += 1

acc = 100.0 * n_correct / n_samples
print(f'Accuracy of the network: {acc} %')

for i in range(num_classes):
    acc = 100.0 * n_class_correct[i] / n_class_samples[i]
    print(f'Accuracy of {classes[i]}: {acc} %')

Conda list:

Error message:

Traceback (most recent call last):
File “/utmnt/ut/ft2/cql7772/Test2/OberflaechenKlassifizierung.py”, line 105, in
outputs = model(images)
File “/fibus/fs1/16/cql7772/.conda/envs/CNN-Klassifizierung/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl
result = self.forward(*input, **kwargs)
File “/utmnt/ut/ft2/cql7772/Test2/OberflaechenKlassifizierung.py”, line 84, in forward
x = self.pool(F.relu(self.conv1(x))) # → batch_size, 6, 70, 70
File “/fibus/fs1/16/cql7772/.conda/envs/CNN-Klassifizierung/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl
result = self.forward(*input, **kwargs)
File “/fibus/fs1/16/cql7772/.conda/envs/CNN-Klassifizierung/lib/python3.9/site-packages/torch/nn/modules/conv.py”, line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File “/fibus/fs1/16/cql7772/.conda/envs/CNN-Klassifizierung/lib/python3.9/site-packages/torch/nn/modules/conv.py”, line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: no kernel image is available for execution on the device

ptrblck · April 30, 2021, 12:17am

The error is raised, if you are trying to execute CUDA code missing the right compute capability for the used device.
The Tesla K20 should have a compute capability of 3.5, which is not shipped in the binaries anymore.
You could build PyTorch from source as described here.

MaxM · April 30, 2021, 6:46am

Thanks for your answer. Unfortunatelly thats a pity. But honestly I am
quite confused, because I looked up which Cuda toolkit the Tesla K20m
supports an it showed me that it supports up to 10.2 . As you can see
in my list of packages I have installed cudatoolkit 10.2.89 hfd86e86_1
. So where is actually the problem? Is it Python 3.9? Sorry for the
amount of questions. I try to understand what the root cause of the
problem is.

Zitat von ptrblck via PyTorch Forums noreply@discuss.pytorch.org:

ptrblck · April 30, 2021, 6:50am

The issue is not caused by CUDA or Python, but by the size of the pip wheels and conda binaries.
Since the pip wheels have an especially hard memory limit, older compute capabilities are removed after some time, which was the case for 3.5. The current CUDA10.2 binaries support compute capabilities 3.7-7.5 and the CUDA11.1 binaries 3.7-8.6.

EDIT: you might also want to check this issue where some users were building binaries for sm_35.

MaxM · April 30, 2021, 10:25am

Many thanks for your answer. Now its a lot clearer for me. I have also the possibility to run my code on a Tesla K80 GPU. Perhaps the error will disappear since the Tesla K80 has a compute capability of 3.7, what should be fine to run my code.

MaxM · May 1, 2021, 9:41pm

My python script worked perfectly well on the K80 card. Many thanks again!