I need HELP. Python tells me that torch cuda uses the GPU, but my GPU is 0% used!

Hello people, I’m doing a very classical image classification exercise using pytorch, with a CNN network with Drop Out, and I’m using cuda to use my GPU, I ran my learning loop and I get very good execution time ~30s/epoch, everything is fine, but after 2 hours when I rerun the code I see that this same learning loop goes to ~10min/epoch, I went back to the base code where there is no modification (where I had ~30s/epoch times just 2 hours before), but I still have ~10min/epoch, I reinstalled my NVIDIA drivers with CUDA, and the torch, torchvision and cudatoolkit libraries, but nothing solved the problem, I asked a friend to try who has the same PC as me and he gets the ~30s/epoch well.

For your information, I am on Windows 10 (without using Anaconda). And to check the use of my GPU, I look directly in the task manager of my Windows 10, which displays the percentage of GPU usage of each program.

I put you some print to show you that CUDA works well.

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(device)
print(torch.cuda.get_device_name())
print(torch.__version__)
print(torch.version.cuda)
x = torch.randn(1).cuda()
print(x)

output :
cuda
NVIDIA GeForce GTX 1060 3GB
1.10.2+cu113
11.3
tensor([-0.6228], device='cuda:0')

and my very simple learning loop :

learning_rate = 0.01
momentum = 0.5
batch_size_train = 40
batch_size_test = 500

data = loadImgs(batch_size_train=batch_size_train, batch_size_test=batch_size_test)
model = Net().to(device)
optimizer = torch.optim.SGD(net.parameters(), lr=learning_rate, momentum=momentum)

num_epochs = 20

for j in range(num_epochs):
    batch_enum = enumerate(data.loader_train)
    i_count = 0
    iterations = data.num_train_samples // data.batch_size_train
    loss_list = []
    for batch_idx, (dt, targets) in tqdm(batch_enum):
        i_count = i_count+1
        outputs = model(dt.to(device))
        loss = F.cross_entropy(outputs, targets.to(device))
        loss_list.append(loss.item())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if i_count == iterations:
            break

    torch.save(net.state_dict(), './data/model_TP'+str(time.time())+'.pt')

You said that you had a successful start utilizing the GPU, and then it shifted to the CPU. Can you get it started running on the GPU again, or is it since that moment just stuck on the CPU instead?

When you did that CUDA reinstall, did you install drivers compatible with your Torch version? I made the mistake of installing the latest CUDA drivers before, and had to downgrade since PyTorch wasn’t up-to-date with those.

Yes, both are under version 11.

In fact all my neural network is under CUDA, so normally under GPU, but when I run my code, I see that the execution time is really slow and in the task manager the percentage of GPU usage is at ~1-4%, while this morning with the same code without changing anything, my GPU is used at 100%, because with CUDA we can not limit the use of the GPU to a certain percentage.

And more concretely, I have the “impression” that my code uses CUDA/GPU well, but it runs in slow motion, I just did a test with :

torch.device(“cpu”)

And it goes at the same speed as :

torch.device(“cuda”)

except that the CPU version, we can see in the task manager that it uses 40% of the processor while with CUDA, there is no use of the CPU almost and the GPU too, I wonder if it is a problem of memory allocated to the GPU, which is very low, but I do not know how to set it, otherwise I have already emptied several times also the cache of cuda.

Also this problem appeared just after I tried to display on my jupyter notebook, a very large list that made my jupyter notebook crash. I don’t know if there is a problem related to this. But I also reinstalled all python and all libraries.

Hm… I would recommend to visit the Nvidia CUDA archive: CUDA Toolkit Archive | NVIDIA Developer
and install the latest (or your torch version) compatible CUDA version for PyTorch.

Me personally have never gotten a mismatched CUDA version to work properly with my PyTorch installations. NVIDIA develops CUDA drivers faster than PyTorch and other projects have time to adapt to them.

I just did a new test and in fact my cuda/gpu works as shown on the picture:

But I don’t know why my code here is very very slow, when you run it you can see that it shows me that my learning loop is at about 7it/s with GPU and 3-4it/s with CPU, but two days ago with the same code I was reaching 35it/s, I tested on a friend’s computer (with a very similar graphics card) and it also has 35s/it. I checked my NVIDIA Cuda driver and my version of cuda toolkit, and they have no problem, but I don’t know why in my code here “that only on my computer” it is slow.

Here is my code:

If you want the data also to test everything is here, (note I am not the person who wrote this code, it is my teacher) :
https://filesender.renater.fr/?s=download&token=517e4ce7-3316-4774-b622-4ee49e85ff39

import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
# from torch.autograd import Variable
import torchvision.transforms as transforms
import torchvision.datasets as dset
import torchvision.utils as vutils
from PIL import ImageFile
# import os

from tqdm import tqdm

learning_rate = 0.01
momentum = 0.5
batch_size_train = 40
batch_size_test = 500

# Dataloader class and function

ImageFile.LOAD_TRUNCATED_IMAGES = True


class Data:
    def __init__(self, dataset_train, dataset_train_original, dataloader_train,
                 dataset_test, dataset_test_original, dataloader_test,
                 batch_size_train, batch_size_test):
        self.train = dataset_train
        self.train_original = dataset_train_original
        self.loader_train = dataloader_train
        self.num_train_samples = len(dataset_train)
        self.test = dataset_test
        self.test_original = dataset_test_original
        self.loader_test = dataloader_test
        self.num_test_samples = len(dataset_test)
        self.batch_size_train = batch_size_train
        self.batch_size_test = batch_size_test


def loadImgs(des_dir="./data/", img_size=100, batch_size_train=40, batch_size_test=100):

    dataset_train = dset.ImageFolder(root=des_dir + "train/",
                               transform=transforms.Compose([
                                   transforms.Resize(img_size),
                                   transforms.RandomCrop(75, padding=4),
                                   transforms.RandomHorizontalFlip(),
                                   transforms.ToTensor(),
                                   transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
                               ]))

    dataset_train_original = dset.ImageFolder(root=des_dir + "train/",
                               transform=transforms.Compose([
                                   transforms.Resize(img_size),
                                   transforms.ToTensor(),
                               ]))

    dataset_test = dset.ImageFolder(root=des_dir + "test/",
                               transform=transforms.Compose([
                                   transforms.Resize(img_size),
                                   transforms.RandomCrop(75, padding=4),
                                   transforms.RandomHorizontalFlip(),
                                   transforms.ToTensor(),
                                   transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
                               ]))

    dataset_test_original = dset.ImageFolder(root=des_dir + "test/",
                               transform=transforms.Compose([
                                   transforms.Resize(img_size),
                                   transforms.ToTensor(),
                               ]))

    dataloader_train = torch.utils.data.DataLoader(dataset_train, batch_size=batch_size_train, shuffle=True)

    dataloader_test = torch.utils.data.DataLoader(dataset_test, batch_size=batch_size_test, shuffle=True)

    data = Data(dataset_train, dataset_train_original, dataloader_train,
                dataset_test, dataset_test_original, dataloader_test,
                batch_size_train, batch_size_test)
    return data

# evaluation on a batch of test data:
def evaluate(model, data):
    batch_enum = enumerate(data.loader_test)
    batch_idx, (testdata, testtargets) = next(batch_enum)
    testdata = testdata.to(device)
    testtargets = testtargets.to(device)
    model = model.eval()
    oupt = torch.argmax(model(testdata), dim=1)
    t = torch.sum(oupt == testtargets)
    result = t * 100.0 / len(testtargets)
    model = model.train()
    print(f"{t} correct on {len(testtargets)} ({result.item()} %)")
    return result.item()

# iteratively train on batches for one epoch:
def train_epoch(model, optimizer, data):
    batch_enum = enumerate(data.loader_train)
    i_count = 0
    iterations = data.num_train_samples // data.batch_size_train
    for batch_idx, (dt, targets) in tqdm(batch_enum):
        i_count = i_count+1
        outputs = model(dt.to(device))
        loss = F.cross_entropy(outputs, targets.to(device))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if i_count == iterations:
            break

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout2d(0.25)
        self.dropout2 = nn.Dropout2d(0.25)
        self.dropout3 = nn.Dropout2d(0.5)
        self.fc1 = nn.Linear(((((75-2)//2-2)//2)**2)*64, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 2)

    def forward(self, x):
        x = F.relu(self.conv1(x.view(-1, 3, 75, 75)))
        x = self.dropout1(F.max_pool2d(x, 2))
        x = F.relu(self.conv2(x))
        x = self.dropout2(F.max_pool2d(x, 2))
        x = torch.flatten(x, 1)
        x = self.dropout3(F.relu(self.fc1(x)))
        x = self.fc2(x)
        x = self.fc3(x)
        return x

data = loadImgs(batch_size_train=batch_size_train, batch_size_test=batch_size_test)

net = Net().to(device)
optimizer = torch.optim.SGD(net.parameters(), lr=learning_rate, momentum=momentum)

# net.load_state_dict(torch.load('./data/model_TP.pt'))
# evaluate(net, data)

num_epochs = 1
for j in range(num_epochs):
    print(f"epoch {j} / {num_epochs}")
    train_epoch(net, optimizer, data)
    evaluate(net, data)
    torch.save(net.state_dict(), './data/model_TP.pt')

In my understanding, GPU speed depends on many things:

0. Batch size

If the batch size is less, more time will be spent on data transfer rather than any useful work with GPU.

1. The temperature of the GPU

If the temperature is too much for the GPU to handle, it will enable hardware/software speed throttling.

2. The hard drive speed (whether local drive/network drive)

Whether you are loading from a local SATA / SSD drive or if the data is located in a network drive.

3. The processor speed (and the time it takes to populate the cache)

In my observation, sometimes it takes some time for the processor to get into full utilization and read/populate data in cache. Especially after the system reboot. So, wait for some time and observe if the speed improves.

Check if any of these things help.

2 Likes