Why is CUDA not increasing performance?


I’m training an autoencoder using this set of scripts (specifically the attention-based parts) with cuda.
I’ve enabled cuda on every tensor I can find, and also set the usual suspects:

torch.backends.cudnn.benchmark = True
torch.backends.cudnn.fastest = True

It’s definitely enabled and I can see a bit of GPU memory in use, but there is no performance speed-up at all. In fact (probably due to all the additional initialisation) I’m seeing a marginal decrease in performance.

Am I missing something obvious, or are the tensor operations simply not particularly suited to GPU?

Note: functionally the scripts work exactly as expected and produce a working trained model.


Let’s start the talk by understanding your machinery.
What base OS do you have ?
Its a local machine or are you running in Cloud Computing ?
What version of Cuda, Cudnn, PyTorch, and Otther Libraries do you have?
How many Cores in CPU and What Model of GPU do you have ?
How many GPUs ?
How did you installed each piece of Software ?
Pre-compiled or did you compiled yourself ?
If so what tutorials have you used or what scripts have you used to do so ?

Let’s start the debate from there.


Thanks, I should maybe have clarified, I’m already running other scripts successfully using cuda on this instance. For example there is a set of RNNs for which I’m seeing a very clear 15X speed-up when enabling cuda. Although the attention-based autoencoder works a little differently, I was hopeful of seeing at least a measurable increase when enabling cuda.

What I’m trying to understand is why there isn’t any improvement in performance. For example, are the type of tensor operations in the attention-based autoencoder not optimised for GPU? Or am I missing something more obvious?

I don’t want to clutter the thread with my version of the scripts, but basically I perform an initial check with:


This succeeds as True, and subsequent enablings of cuda on tensors proceed without error.

Regarding your question, I’ve tried the attention-based scripts (with some modifications for cuda) on two different Cloud platforms. One of them is an AWS Sagemaker instance, info below.

Instance type: ml.p2.xlarge (4 X vCPU, 1 X K80 GPU, 61GiB Memory)
Notebook kernel: conda_pytorch_p27
Kernel: 4.14.72-68.55.amzn1.x86_64 (Amazon modified)
OS: Red Hat 7.2.1-2
Pytorch: 0.4.1
Cuda: 9.2.148
Cudnn: 7104

These are all provided by Sagemaker by default.


Hi again,

Thanks to let me understand your environment a little bit.

There is some occasions that CPU and GPU may have the same performance.
It depends the architecture of your network, may depend of the data volume are you using.

May be the Network is too shallow.
May be increasing the batch size of your GPU training you can see the difference compared to your CPU.
May the code have some latency between each batch sent to the GPU.
Without see your code/model and the way your data is trained is hard to have a clue of it.

Note: GPU accelerates matrix computations but if your computations are not so big or difficult at all to compute almost the same performance can be achieved by the CPU. ( a modern GPU 1024+ threads while the top CPU today has only 200 threads).

Your instance has 4 cores that may give 8 threads, but to me it means your data is very easy to process that the CPU doesn’t have problems at all to do it.


Thanks for the explanation, I see what you’re getting at.
As per your suggestion I’m going to play around a bit more with the input and hyperparameters. I’ll post back with any new findings.