Performance with CUDA against CPU

Hi, I’m trying to understand the CUDA implementation and how to increase performance of the neural network but I’m facing the following issue and I will like any guidance on the topic. I’m performing a very simplistic forward pass for a random tensor (code attached). However, I’m getting better timing using the CPU when compared with the GPU (a result I didn’t expected)

import random
from time import time

import torch

class SmallModel(torch.nn.Module):
    def __init__(self, in_f) -> None:
        super().__init__()
        self.cnn = torch.nn.Sequential(
            torch.nn.Linear(in_f, 200),
            torch.nn.ReLU(),
            torch.nn.Linear(200,200),
            torch.nn.ReLU(),
            torch.nn.Linear(200,200),
            torch.nn.ReLU(),
            torch.nn.Linear(200,100)
        )
    
    def forward(self, x:torch.Tensor)->torch.Tensor:
        print(x.device)
        return self.cnn(x)

device = 'cuda'

a = torch.randn((100,100))
a = a.to(device)
m = SmallModel(100, device)
m = m.to(device)
start_time = time()
b = m(a)

print(f"Total time: {(time() - start_time)*1000} ms with device {device}")

The timing for this evaluation is:
with CUDA: 398 ms
with CPU: 1.50 ms

The specifications of my computer are:
GPU: Nvidia Geforce 1660 TI 6GB with CUDA 11.7
CPU: AMD Ryzen 7 2700X 8 core
Memory: 24 GB
Pytorch version: 1.13.0+cu117

I will sincerely appreciate any hint on this topic.

Several points:

  1. You need to do at least some warmup - because 1st execution is never optimal neither on CPU nor GPU:

    So add stuff like:

    for k in range(20):
        b=m(a)
    

    Before you take measurement

  2. The network is really tiny to have significant benefit for GPU.

  3. Measure over several iterations of execution to get reliable values

  4. Since pytorch GPU execution is asynchronous - make sure at the end you have synchronisation point (like taking some result to CPU) b[0,0].item()

1 Like

Hi, thanks for the hints and sorry about the delayed reply. I did change some parameter and found that doing around 100 loops or so does show a slightly better performance of the GPU.

Since I was doing this tests in order to understand the best way for deploying a ANN for some real time applications, I will like to know what is the best way to implement it without the large overhead of transfering the data to the GPU. I will really appreciate any suggestion or literature about this topic.

Thanks!

It all depends on network size, for some tiny networks it is indeed does not make huge difference, however take in account that usually the data is relatively small, you keep the network in GPU and only transfer in inputs and outputs to CPU.

Once again your network is really tiny that does not utilize the power of GPU. So it depends on the specific setup. If you start with conv-nets or something bigger GPU is much more powerful.