Conv2D computations 10x faster with high GPU memory usage than low GPU memory usage

tim_n · July 27, 2022, 1:44pm

Hi everyone,

I’m currently experiencing a weird behavior when using the nn.Conv2d module.
When I’m running the following code on a more or less empty GPU I get way slower computation time (~10x) than when the GPU Memory is almost full.

import time
import torch
import torch.nn as nn

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

with torch.no_grad():
    filter = nn.Conv2d(
        in_channels=1,
        out_channels=1,
        kernel_size=3,
        padding=1,
        bias=False, 
        device=device)

total_start = time.time()
for i in range (400):
    a = torch.rand((1, 1, 128, 128), device=device)
    b = torch.rand((4096, 1, 128, 128), device=device)
    with torch.no_grad():
        c= filter(a)
        d= filter(b)
print(f'Total execution time {time.time()-total_start:.6f} seconds')

The execution of the code can be accelerated by 10x by previously blocking GPU Memory running the following code in a different python terminal. The Batchsize has to be adjusted to get a joint memory consumption of more than 90% GPU Memory. For a Nvidia GeForce RTX 3090 with running display drivers 260000 seems to be a good fit.

import torch
a = torch.rand((260000, 1, 128, 128), device='cuda')

I tested the behavior on various different GPU memory utilization and found that once the combined utilization surpasses ~90% of GPU memory, the execution speed will increase tremendously. I have also tested the behavior on two different systems both equipped with a Nvidia GeForce RTX 3090 and both showed the same behavior.

With 8656MiB / 24576MiB Memory usage I get the following output:

Total execution time 20.186182 seconds

With 22908MiB / 24576MiB Memory usage I get the following output:

Total execution time 2.597108 seconds

Has anybody experienced this before and can lead me to how I can get the fast execution speed on low GPU Memory utilization?

My torch version is 1.11.0
My cudatoolkit version is 11.3.1

Thank you in advance!

ptrblck · July 28, 2022, 12:54am

CUDA operations are executed asynchronously, so you would need to synchronize the code via torch.cuda.synchronize() before starting and stopping the timers.

tim_n · July 28, 2022, 6:41am

Thank you very much for the quick reply!
I modified the code to synchronize the CUDA operations as follows, however I still get similar results.

import time
import torch
import torch.nn as nn

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

with torch.no_grad():
    filter = nn.Conv2d(
        in_channels=1,
        out_channels=1,
        kernel_size=3,
        padding=1,
        bias=False, 
        device=device)

torch.cuda.synchronize()
total_start = time.time()
for i in range (400):

    a = torch.rand((1, 1, 128, 128), device=device)
    b = torch.rand((4096, 1, 128, 128), device=device)
    with torch.no_grad():
        c= filter(a)
        d= filter(b)
torch.cuda.synchronize()
print(f'Total execution time {time.time()-total_start:.6f} seconds')

This time computation slowed down a bit but is still about ten times slower with the empty GPU memory.
With 7185MiB / 24576MiB memory usage:

Total execution time 25.066509 seconds

With 22920MiB / 24576MiB memory usage:

Total execution time 2.779978 seconds

I suppose that something might be going wrong with the memory allocation which leads to very slow computation with a lot of free space to allocate and speeding up once the free space gets reduced.

What confuses me however is, that with a slightly lower memory usage of 21249MiB / 24576MiB (which I get by running: a = torch.rand((200000, 1, 128, 128), device='cuda') in a different python terminal) the computation time is still very long.

Total execution time 25.214885 seconds

Which means that the increase of speed is not continuous but suddenly happens once a certain threshold of memory usage is surpassed.
I will try to test this on a different graphics card with a different CUDA version to see if this is a problem related to the RTX 3090 or the cudatoolkit version, however I’m still very thankful for any ideas on this topic.
Thanks a lot!

tim_n · July 28, 2022, 2:30pm

After updating to torch 1.12.0 the problem doesn’t seem to appear anymore. Thanks for the help!