Model is faster on CPU than GPU

I made a simple model to see if using my GPU would improve speed, it did not. I want to understand if my GPU works correctly or whether I need to change settings to have more benefit from GPU.

My GPU is recognized, but… in the task manager it seems to be never busy.

Would my GPU have more benefit if I changed settings (like batch size, or larger model, or more data)?

  • My data shape is: (100000, 121)
  • batch size: 32
  • epochs: 10
  • model:
PremiumPredictor(
   (fc1): Linear(in_features=121, out_features=64, bias=True)
   (fc2): Linear(in_features=64, out_features=1, bias=True)
   (relu): ReLU()
)

Takes 37s on GPU, and 19s on CPU.

You would need to select the compute view in Windows’ Task Manager as the default view won’t show the utilization of compute applications.

Yes, increasing the workload should increase the utilization as your current use case might be CPU-limited. If you want to use this small model, you might want to use CUDA Graphs as it will reduce the CPU workload.

1 Like

Thanks @ptrblck . I tried the compute view in the task manager but that doesn’t show anything.

I ran a different code snippet this time (see below), which maybe doesn’t show in compute? I only see a spike when I close/restart the python kernel in one or two of the Copy views.

nvidia-smi in cmd gives me most hope that my GPU is doing something. Temperature increases from 45 to 76 Celsius and the Volatile GPU-Util goes from 0% to 100%

import torch
import time

device = "cuda"

x = torch.randn(10000, 10000)

## CPU version
start_time = time.time()
_ = torch.matmul(x, x)
end_time = time.time()
print(f"CPU time: {(end_time - start_time):6.5f}s")

for _ in range(10):
    ## GPU version
    x = x.to(device)
    _ = torch.matmul(x, x)  # First operation to 'burn in' GPU
    # CUDA is asynchronous, so we need to use different timing functions
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    _ = torch.matmul(x, x)
    end.record()
    torch.cuda.synchronize()  # Waits for everything to finish running on the GPU
    print(f"GPU time: {0.001 * start.elapsed_time(end):6.5f}s")  # Milliseconds to seconds
# CPU time: 20.77174s
# GPU time: 0.59403s

Great, so please stick to it as we have seen a lot of confusion when Windows’ Task Manager is used.

Wanted to add for other users:

In the task manager there is a cuda, which I my case shows work is being done, while the compute view shows nothing:

1 Like

CPU cores are much faster than GPU cores, but there are only a few of them per CPU.

For example, an Intel i9 has 24 cores with up to 6 GHz speed.

An Nvidia 4090 has 512 Tensor Cores which are only 2.2GHz base clock.

So unless you’re giving enough concurrent operations to utilize at least 72 of those cores, the CPU will be faster.

Hence, a larger model or bigger batch size might see a more efficient use of your GPU.

Thanks for your reply. And indeed, I see that with a larger model and/or larger batch size, GPU is faster.

Besides the speed of blasting through samples, there is the performance, and from what I read online, a smaller batch size is considered to generalize better right? So, that’s one reason to not go overboard with an extremely large batch size, I guess.

It’s quite the opposite. Larger batch size is better for generalization. Usually memory constraints are the reason for smaller batches. But accumulating gradients before backprop across batches can mimic a larger batch size.

I read online that large batch size leads to sharp minima, while small batch size leads to flat minima generally.

E.g. this post on medium which discusses an (old) paper: Why Small Batch sizes lead to greater generalization in Deep Learning | by Devansh | Geek Culture | Medium

Interesting paper. See page 7,

" It is often reported that when increasing the batch size for a problem, there exists a threshold after which there is a deterioration in the quality of the model. This behavior can be observed for the F2 and C1 networks in Figure 4. In both of these experiments, there is a batch size (≈ 15000 for F2 and ≈ 500 for C1) after which there is a large drop in testing accuracy."

In other words, accuracy increases up to a certain point as batch size increases, until around 15,000 and 500 respectively in these given models and datasets. If your batch size is 32, you’re probably still well below that threshold.

I do not understand one thing.

Recently, while working on my project, I trained a neural network with approximately 10K parameters. Initially, I used a batch size of 128, and the network converged quickly. To better utilize my GPU, I increased the batch size to 2048, but to my surprise, the convergence rate slowed down compared to 128.

May be do I need to reduce my learning rate, if I increase my batch size or vice versa for small batch sizes ?

One thing I learned is that larger batch sizes often lead to better generalization, while smaller batch sizes tend to result in faster convergence. How do I know which one to follow.

Can somebody explain this to me, Please ?

Thank you!

The paper shared earlier by @Eardrum7 suggested that larger batch size(as in 10% or more of the total training data) showed a consistently lower performance drop.

However, I don’t know of any definitive studies that show how this behavior scales based on total dataset size. That would be interesting to see.

With a larger batch size, you have fewer update steps per epoch. So, with the same learning rate you converge slower. If you want to use a larger batch size you can increase the learning rate and have the same convergence speed. See this paper.

They show that you can have the same convergence speed, but due to larger batch size, much faster computation, resulting in less total time. You do have to be careful with the first updates, see Figure 2.