Algorithm doesn't run on GPU even after storing model and data into GPU. What am i missing?

You can find training section of my code below:

device = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’)

Note: I am receiving True when i check torch.cuda.is_available

After creating CNN model i wrote:

model = model.to(device)

Training Section:

import time
start_time = time.time()

epochs = 3

#Limits on numbers of batches if you want train faster(Not mandatory)
max_trn_batch = 800 # batch 10 image --> 8000 images total
max_tst_batch = 300 # batch 10 image --> 3000 images total

train_losses = []
test_losses = []
train_correct = []
test_correct = []

for i in range(epochs):

trn_corr = 0
tst_corr = 0

for b,(X_train,y_train) in enumerate(train_loader):
    X_train,y_train = X_train.to(device),y_train.to(device)
    
    #optinal limit number of batches
    if b == max_trn_batch:
        break
    b = b + 1
    
    y_pred = model(X_train)
    loss = criterion(y_pred,y_train)
    
    predicted = torch.max(y_pred.data,1)[1]
    batch_corr = (predicted == y_train).sum()
    trn_corr = trn_corr + batch_corr
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if b%200 == 0:
        print('Epoch:  {} Loss:  {} Accuracy:  {}'.format(i,loss,trn_corr.item()*100/(10*b)))

train_losses.append(loss)
train_correct.append(trn_corr)


#test set

with torch.no_grad():
    for b,(X_test,y_test) in enumerate(test_loader):
        X_test,y_test = X_test.to(device),y_test.to(device)
        
        #Optional
        if b==max_tst_batch:
            break
        y_val = model(X_test)
        predicted = torch.max(y_val.data,1)[1]
        batch_corr = (predicted == y_test).sum()
        tst_corr = tst_corr + batch_corr

loss = criterion(y_val,y_test)
test_losses.append(loss)
test_correct.append(tst_corr)

total_time = time.time() - start_time
print(f’Total Time: {total_time/60}) minutes’)

And during the training i am checking the CPU and GPU performance , CPU working %100 while GPU %1.

Note 1:Algorithm took 13 minutes when i use CPU as device, and took 7 min when i used GPU as device, so there seems tiny improvement, but i couldnt see any gpu utilization on task manager during training.

Note 2: Paremeters

ConvolutionalNetwork(
  (conv1): Conv2d(3, 6, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))
  (fc1): Linear(in_features=46656, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=2, bias=True)

Thanks in advance

What are your pytorch and cuda versions?

I think you’re not assigning your cuda device properly. Cuda device is usually assigned like this:
cuda:<device number> (usually, it’s 0). You can get the device number by running this command:
torch.cuda.current_device()

Pytorch version is 1.6
Cuda version is 10.2.89

I assigned it as you mentioned and observed no difference unfortunately.

Try this:

Still can not be sure if i am really using GPU, i searched from internet and found a suggestion to use nvidia-smi to check utilization during training. GPU utilization is 1 percent. This is my full code btw:

You should get an error, if you try to push tensors or the model to the GPU and no GPU is available.
Also, once the data is transferred, you can check the device via print(X_test.device) and make sure it’s cuda:id.
Your GPU utilization might be that low as your model is really small, such that you would see the overhead of e.g. the kernel launches, data loading and processing etc.

Thank you very much for the answer, i increased batch size to understand if thats the case and observed that utilization is increased from %1 to %3 sometimes. So it was really about size of my data.