PyTorch TRAINING A CLASSIFIER tutorial error during CUDA run

Boris_Yazmir · June 14, 2020, 9:27pm

I am trying to run PyTorch TRAINING A CLASSIFIER tutorial code with CUDA. https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py The code is below. The code runs fine for CPU, but gives the following error with GPU-CUDA run:

"Process finished with exit code -1073741819 (0xC0000005) The error happens during execution of the loss.backward()

I will be very grateful to solution.

Thanks

import torch
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
import torch.optim as optim

# Let’s first define our device as the first visible cuda device if we have CUDA available:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# device = torch.device("cpu")

# Assuming that we are on a CUDA machine, this should print a CUDA device:

print(device)
# The output of torchvision datasets are PILImage images of range [0, 1]. We transform them to Tensors of normalized range [-1, 1]. .. note:
#
# If running on Windows and you get a BrokenPipeError, try setting
# the num_worker of torch.utils.data.DataLoader() to 0.

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

# Use CIFAR10
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=0)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=0)

classes1 =trainset.classes
classes =tuple(classes1)
# https://www.w3schools.com/python/python_tuples.asp
# https://www.geeksforgeeks.org/python-convert-a-list-into-a-tuple/
# classes = ('plane', 'car', 'bird', 'cat',
#            'deer', 'dog', 'frog', 'horse', 'ship', 'truck')



# functions to show an image


def imshow(img):
    img = img / 2 + 0.5     # unnormalize. The output of torchvision datasets
    # are PILImage images of range [0, 1]. We transform
    # them to Tensors of normalized range [-1, 1]. ..
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()


# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()

# show images
imshow(torchvision.utils.make_grid(images))
# print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(4))) # Batch size is 4, Therefore we have 4 images.

# Define a Convolutional Neural Network

# Copy the neural network from the Neural Networks section before and modify it to take 3-channel images (instead of 1-channel images as it was defined).
# Important - this is the way to upgrade my BBOB net.

import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()
net.to(device)


# Define a Loss function and optimizer
# Let’s use a Classification Cross-Entropy loss and SGD with momentum.


criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

# Train the network
# This is when things start to get interesting. We simply have to loop over our data iterator, and feed the inputs to the network and optimize.
# zero the parameter gradients
# optimizer.zero_grad()

for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data
        inputs = inputs.to(device)
        labels = labels.to(device)
        # inputs, labels = data[0].to(device), data[1].to(device)
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

# Let’s quickly save our trained model:

PATH = './cifar_net.pth'
torch.save(net.state_dict(), PATH)

# Test the network on the test data
dataiter = iter(testloader)
images, labels = dataiter.next()

# print images
imshow(torchvision.utils.make_grid(images))
print('GroundTruth: ', ' '.join('%5s' % classes[labels[j]] for j in range(4)))

# Next, let’s load back in our saved model (note: saving and re-loading the model wasn’t necessary here, we only did it to illustrate how to do so):

net = Net()
net.load_state_dict(torch.load(PATH))

# Okay, now let us see what the neural network thinks these examples above are:
outputs = net(images)

# The outputs are energies for the 10 classes. The higher the energy for a class, the more the network thinks that the image is of the particular class. So, let’s get the index of the highest energy:

_, predicted = torch.max(outputs, 1)

print('Predicted: ', ' '.join('%5s' % classes[predicted[j]]
                              for j in range(4)))

correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (100 * correct / total))

# Hmmm, what are the classes that performed well, and the classes that did not perform well:

class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs, 1)
        c = (predicted == labels).squeeze()
        for i in range(4):
            label = labels[i]
            class_correct[label] += c[i].item()
            class_total[label] += 1


for i in range(10):
    print('Accuracy of %5s : %2d %%' % (
        classes[i], 100 * class_correct[i] / class_total[i]))

ptrblck · June 15, 2020, 3:09am

This error seems to be caused by a variety of issues, which apparently point towards a broken installation as seen e.g. here.

Could you create a new virtual environment, reinstall PyTorch there, and rerun the code, please?

peterjc123 · June 16, 2020, 4:41am

Looks like criterion is not moved to GPU.

Boris_Yazmir · June 17, 2020, 12:52pm

Hi. I did it nothing helps. The same code runs fine on another computer. Any ideas?

Boris_Yazmir · June 17, 2020, 12:53pm

Thanks. Tried it and it didn’t helped. Any ideas?

Boris_Yazmir · June 24, 2020, 1:44pm

Anybody please help it is still not working

ptrblck · June 25, 2020, 5:08am

Try to use a docker container and and rerun the code there. If the error is still raised, there might be some issue in your system directly. If the docker container is working, try to reinstall “more” packages on your bare metal, such as the drivers etc.

PGG-DeepAI · September 18, 2020, 12:51am

Thats is no package error, is that you failed to send data to device, do this to solve it:

images, labels = dataiter.next()
images, labels = images.to(device), labels.to(device) #ADD THIS LINE EVERY TIME YOU USE DATA

You will see that if you use CUDA to show NUMPY data you will have an error, DO NOT SEND to device the data that will use numpy! (as numpy only works on cpu)

dataiter = iter(testloader)
images, labels = dataiter.next()
images, labels = images.to(device), labels.to(device) #sending here the data will make...

# print images
imshow(torchvision.utils.make_grid(images))             #THIS FAIL...numpy cpu...
print('GroundTruth: ', ' '.join('%5s' % classes[labels[j]] for j in range(4)))

TO MAKE ALL WORK ONLY SEND THEM BEFORE OUTPUT :

inputs, labels = inputs.to(device), labels.to(device)
outputs = net(inputs)
loss = criterion(outputs, labels)

#and on:

with torch.no_grad():
    for data in testloader:
        images, labels = data
        images, labels = images.to(device), labels.to(device) #here you go
        outputs = net(images)