Cpu faster than gpu?

I am running PyTorch on GPU computer.
Actually I am observing that it runs slightly faster with CPU than with GPU.
About 30 seconds with CPU and 54 seconds with GPU.
Is it possible?
There are some steps where I convert to cuda(), could that slow it down?

Could it be a problem with the computer- it is cloud computer service.

Hard to share my code as it is kind of long and somewhat proprietary.

1 Like

Could you explain your use case a bit?
If your workload isn’t that big (e.g. small model) this might be the case.
Are you using small batches with very little calculation?

2 Likes

This can happen when the cost of transferring data between RAM and GPU memory is more than the speedup of parallel computation on the GPU.

One case, as mentioned in @ptrblck’s answer is when your model is quite small. Another case is when you have too many back-and-forth transfers of data in your forward() function. See https://pytorch.org/docs/stable/notes/cuda.html for an example of when data is transferred between RAM and GPU memory.

2 Likes

Its like a five layer convolutional network with 64 elements per layer, and then I use minibatch size of 1000. Input vector size of 64.
That might qualify as small model.

By 64 elements you mean 64 out_channels/kernels?
Are you using a DataLoader? Do you have any additional transfers in your model as @samarth-robo asked?

Not using a dataloader.

class CNN(nn.Module):
    def __init__(self ,NumBins=32):
        self.InNodes=int(NumBins)*2
        self.MediumNode=self.InNodes*2
        super(CNN, self).__init__()
        self.Lin1 = nn.Linear(self.InNodes , self.MediumNode)
        self.Lin2 = nn.Linear(self.MediumNode,  self.MediumNode)
        self.Lin5 = nn.Linear(self.MediumNode, 2)
    def forward(self, input):
        Zoutput = self.Lin1(input)
        Zoutput=F.relu(Zoutput)
        Zoutput = self.Lin2(Zoutput)
        Zoutput=F.relu(Zoutput)
        Zoutput = self.Lin5(Zoutput)
        return Zoutput

The model itself is faster on my machine using the GPU:

x = torch.randn(1000, 64)
model = CNN()

cpu_times = []

for epoch in range(100):
    t0 = time.perf_counter()
    output = model(x)
    t1 = time.perf_counter()
    cpu_times.append(t1 - t0)

device = 'cuda'
model = model.to(device)
x = x.to(device)
torch.cuda.synchronize()

gpu_times = []
for epoch in range(100):
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    output = model(x)
    torch.cuda.synchronize()
    t1 = time.perf_counter()
    gpu_times.append(t1 - t0)

print('CPU {}, GPU {}'.format(
    torch.tensor(cpu_times).mean(),
    torch.tensor(gpu_times).mean()))
> CPU 0.0018446099711582065, GPU 0.0003588759864214808

Try to use a DataLoader with pin_memory=True if you are using the GPU, as this will pre-load the next batches to the page-locked memory while the GPU is busy.

What numbers do you get running my script?

PS: I’ve formatted your code. You can add code blocks using three backticks `. :wink:

2 Likes

thank you for answer.
There is another difference.
I have a generating function
‘’’
aInput = f()
‘’’
that generates data, and then I have a conversion step,
‘’’
aInput=aInput.cuda().
‘’’
Could that slow it down?

Where are you calling this code?
Are you pushing all your data onto the GPU or is it just a batch?
Usually you push the data in the training loop onto the GPU.

BigBatchSize=100000
miniBatchSize=1000
BigInput,BigTruth=ComputeResults(BigBatchSize)
for s in range(100):
      xPermute=torch.randperm(BigBatchSize)
      indices=xPermute[0:miniBatchSize]
      sInput=BigInput[indices,:]
      sTruth=BigTruth[indices,:]
      if GPU:
        sInput=sInput.cuda()
        sTruth=sTruth.cuda()
      outGuess=aModel(sInput)

The randperm call is quite expensive. Especially since you only need 1000 indices.
You could use torch.randnint, if duplicates are OK. Otherwise maybe you could permute once outside the loop and then use a “sliding” window getting your mini batch size.
This would affect both the CPU and GPU code.
Were you able to run my small timing script?

@mattinjersey, it seems to me that the difference between your code and @ptrblck’s code is that the latter only measures the time for computation on the GPU, it does not account for the data transfer time (data transfer happens only once in @ptrblck’s code).

On the other hand, you transfer data to the GPU at every iteration, and hence you are observing the additional time required for that.

Is it possible, as @ptrblck suggested, to write your ComputeResults() function as a dataset, so that a dataloader can generate data batches in pin-locked memory (by passing pin_memory=True)? Once the data batch is in pin-locked memory, you can also pass non_blocking=True to the cuda() calls so that data transfer happens asynchronously with computations.

These 2 steps should help amortize the data transfer costs that are slowing you down.

1 Like

Yes I confirm the timing results of @ptrblck .
I will try this dataset approach that you have described…
Thx! Matt

So, are there any profile tool that detects GPU performance in python?

Maybe torch.utils.bottleneck might be suitable for your profiling.

also…
GPU-z can check GPU speed for Windows. On Linux, nvidia-smi -l will do the trick.

Hi, I have a similar issue. I am trying to train a simple CNN network using my gpu (gtx 1070). Here is the code:

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import pandas as pd 
from torchvision import datasets, transforms
from torchvision.utils import make_grid
from torch.utils.data import DataLoader
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import time
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
## Create transform method 
transform = transforms.ToTensor()

#Download test and train data 
train_data = datasets.MNIST(root='C:\\Users\\aybars\\Google Drive\\pytorch',train=True,download=True,transform=transform)
test_data =datasets.MNIST(root='C:\\Users\\aybars\\Google Drive\\pytorch',train=False,download=True,transform=transform)
 
#create loader object for batch loading

train_loader = DataLoader(train_data,batch_size=10,shuffle=True,pin_memory=True)
test_loader =DataLoader(test_data,batch_size=10,shuffle=True,pin_memory=True)

#create conv layers 
conv1 = nn.Conv2d(1,6,3,1)
conv2 = nn.Conv2d(6,16,3,1)

class ConvolutionalNeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1,6,3,1)
        self.conv2 = nn.Conv2d(6,16,3,1)
        self.fc1 = nn.Linear(5*5*16,120)
        self.fc2 = nn.Linear(120,84)
        self.fc3 = nn.Linear(84,10)
        
    def forward(self,X):
        X = F.relu(self.conv1(X))
        X = F.max_pool2d(X,2,2)
        X = F.relu(self.conv2(X))
        X = F.max_pool2d(X,2,2)
        X = X.view(-1,5*5*16)
        X = F.relu(self.fc1(X))
        X = F.relu(self.fc2(X))
        X = F.log_softmax(self.fc3(X),dim=1)
        return X

torch.manual_seed(42)
model = ConvolutionalNeuralNetwork()
model = model.to(device)
print (model)


criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=0.001)


epoch = 2
train_losses = []
test_losses= []
train_correct=[]
test_correct =[]
t = time.time()
for i in range(epoch):
    trn_correct = 0
    tst_correct = 0
    for b,(X_train,y_train) in enumerate(train_loader):
        X_train=X_train.to(device)
        y_train=y_train.to(device)
        y_pred = model.forward(X_train)
        loss = criterion(y_pred,y_train)
        trn_correct+=(torch.max(y_pred,1)[1]==y_train).sum()
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if b%600==0:
            print(f"Epoch : {i} BATCH: {b} LOSS: {loss.item()}")

    train_losses.append(loss)
    train_correct.append(trn_correct)
    print(f'train_accuracy : {trn_correct.item()/len(train_data)}')
elapsed_time = time.time()-t
print(f'Duration : {elapsed_time}')

it takes roughly 150 seconds. However when I change

device = torch.device("cuda:0" if use_cuda else "cpu") 

to

device = torch.device("cpu" if use_cuda else "cpu") 

and setting pin_memory parameters to “False”
without changing anything else it takes 86 seconds.

If you increase the workload (e.g. more filters in the conv layers or more conv layers in general), you should see the benefits of using the GPU. As explained above, tiny workloads might suffer from the overheads of pushing the data to the device as well as the kernel launches.

I used this code exactly and yet my gpu (GTX 2080 with 50th percentile performance) was slower than my cpu (i7-8700) with the results being: CPU 0.0009041459998115897, GPU 0.0022378258872777224. Any reason why this might be? If i increase the data size a lot the GPU begins to clearly outperform but I thought my GPU was pretty good so why wouldn’t I see a performance difference at the same data quantities that your machine does?

I don’t know which code you are using or referring to. In case you are using conv layers, use torch.backends.cudnn.benchmark = True, add some warmup iterations, and profile the code again.
You could also profile the code using the PyTorch profiler or e.g. Nsight Systems to check for potential bottlenecks.