Why is Pytorch (CUDA) running slow on GPU

Hey there.

I have been playing around with Pytorch on Linux for some time now and recently decided to try get more scripts to run with my GPU on my Windows desktop. Since trying this I have noticed a massive performance difference between my GPU execution time and my CPU execution time, on the same scripts, such that my GPU is significantly slow than CPU. To illustrate this I just a tutorial program found here (https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-tensors)

import torch
import datetime

dtype = torch.double
#device = torch.device("cpu")
device = torch.device("cuda:0")

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

start = datetime.datetime.now()
learning_rate = 1e-6
for t in range(5000):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    #print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
end = datetime.datetime.now()


I increased the number of Epoch’s from 500 to 5000 as I have read that the first CUDA call is very slow due to initialisation. However the performance issue still exists.

With device = torch.device("cpu") the final time printed out is normal around 3-4 seconds, well device = torch.device("cuda:0") executes in around 13-15 seconds

I have reinstalled Pytorch a number of different ways (uninstalling the previous installation of course) and the problem still persists. I am hoping that someone can help me, if I have perhaps missed a set (didn’t install some other API/program) or am doing something wrong in the code.

Python: v3.6


GPU: NVIDIA GeForce GTX 1060 6GB

CUDA: 9.0 (According to torch.version.cuda)

Any help would be appreciated :slight_smile:

Five observations:

  • for me the (i5-7500 CPU reporting for processors and a 1080Ti), 5000 loops on CUDA will be 12 seconds, but CPU much longer (500 loops in 23 seconds),
  • double is much slower on the GPU than float. This is why float is the standard type in PyTorch. On (x86) CPUs, it probably doesn’t matter much,
  • loss = (y_pred - y).pow(2).sum().item() will take the result (living on GPU up until the sum()) and then transfer it to the CPU for .item(). This kind of synchronisation point makes the it slow. You can drop this line. For printing, I usually only to this every x iterations where x is adjusted to mean every 5-10 seconds,
  • your network is really small, so one would expect GPU to not provide as much advantages as for larger ones,
  • automatic backward is probably faster than the manual one in this case.

Best regards



hey! if anyone is running into this issue I found that dtype=double is troublesome for the gpu so convert everything to .float() and there is a speed difference of about 1 order for me.