Conv2D computations 10x faster with high GPU memory usage than low GPU memory usage

Hi everyone,

I’m currently experiencing a weird behavior when using the nn.Conv2d module.
When I’m running the following code on a more or less empty GPU I get way slower computation time (~10x) than when the GPU Memory is almost full.

import time
import torch
import torch.nn as nn

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

with torch.no_grad():
    filter = nn.Conv2d(
        in_channels=1,
        out_channels=1,
        kernel_size=3,
        padding=1,
        bias=False, 
        device=device)

total_start = time.time()
for i in range (400):
    a = torch.rand((1, 1, 128, 128), device=device)
    b = torch.rand((4096, 1, 128, 128), device=device)
    with torch.no_grad():
        c= filter(a)
        d= filter(b)
print(f'Total execution time {time.time()-total_start:.6f} seconds')

The execution of the code can be accelerated by 10x by previously blocking GPU Memory running the following code in a different python terminal. The Batchsize has to be adjusted to get a joint memory consumption of more than 90% GPU Memory. For a Nvidia GeForce RTX 3090 with running display drivers 260000 seems to be a good fit.

import torch
a = torch.rand((260000, 1, 128, 128), device='cuda')

I tested the behavior on various different GPU memory utilization and found that once the combined utilization surpasses ~90% of GPU memory, the execution speed will increase tremendously. I have also tested the behavior on two different systems both equipped with a Nvidia GeForce RTX 3090 and both showed the same behavior.

With 8656MiB / 24576MiB Memory usage I get the following output:

Total execution time 20.186182 seconds

With 22908MiB / 24576MiB Memory usage I get the following output:

Total execution time 2.597108 seconds

Has anybody experienced this before and can lead me to how I can get the fast execution speed on low GPU Memory utilization?

My torch version is 1.11.0
My cudatoolkit version is 11.3.1

Thank you in advance!

CUDA operations are executed asynchronously, so you would need to synchronize the code via torch.cuda.synchronize() before starting and stopping the timers.

Thank you very much for the quick reply!
I modified the code to synchronize the CUDA operations as follows, however I still get similar results.

import time
import torch
import torch.nn as nn

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

with torch.no_grad():
    filter = nn.Conv2d(
        in_channels=1,
        out_channels=1,
        kernel_size=3,
        padding=1,
        bias=False, 
        device=device)

torch.cuda.synchronize()
total_start = time.time()
for i in range (400):

    a = torch.rand((1, 1, 128, 128), device=device)
    b = torch.rand((4096, 1, 128, 128), device=device)
    with torch.no_grad():
        c= filter(a)
        d= filter(b)
torch.cuda.synchronize()
print(f'Total execution time {time.time()-total_start:.6f} seconds')

This time computation slowed down a bit but is still about ten times slower with the empty GPU memory.
With 7185MiB / 24576MiB memory usage:

Total execution time 25.066509 seconds

With 22920MiB / 24576MiB memory usage:

Total execution time 2.779978 seconds

I suppose that something might be going wrong with the memory allocation which leads to very slow computation with a lot of free space to allocate and speeding up once the free space gets reduced.

What confuses me however is, that with a slightly lower memory usage of 21249MiB / 24576MiB (which I get by running: a = torch.rand((200000, 1, 128, 128), device='cuda') in a different python terminal) the computation time is still very long.

Total execution time 25.214885 seconds

Which means that the increase of speed is not continuous but suddenly happens once a certain threshold of memory usage is surpassed.
I will try to test this on a different graphics card with a different CUDA version to see if this is a problem related to the RTX 3090 or the cudatoolkit version, however I’m still very thankful for any ideas on this topic.
Thanks a lot!

After updating to torch 1.12.0 the problem doesn’t seem to appear anymore. Thanks for the help!