Slow convolution on GPU

Hi all,

I want to do a lot of convolution on GPU (but not in the context of deep learning, there is no dataloader and no model). I have 100 images of size 1000*1000 with 1 kernel 256*256. But I can’t figure out with it is so slow (basically same computation time than on CPU). So, I try to display the computation time/the real time elapsed but I am bit lost as it seems there is some things to take care of like using torch.cuda.synchronize(), torch.no_grad(), torch.cuda.empty_cache().

So I have written some code below and I am not quite sure what to do but what I know is that currently a convolution on GPU take well too much time ~1 min and I feel like it is not the computation itself but the transfert GPU/CPU.

Does someone have some hint on this ? Is it normal ? Is it because of my CUDA installation ? Do I miss something ?

Thank you !

def my_function():
   imgs_gpu = imgs.to(device)
   kernel_gpu = kernel.to(device)
   conv_gpu = functional.conv2d(imgs_gpu, kernel_gpu, padding="valid")
   imgs_cpu = imgs.squeeze().cpu().numpy()

s = time.time()
# computation
for i in range(5):
   start = time.time()
   my_function()
   end = time.time()
   print("{:.2f} s".format(end-start))
   # computation with synchronize
   torch.cuda.synchronize()
   start = time.time()
   my_function()
   torch.cuda.synchronize()
   end = time.time()
   print("{:.2f} s".format(end-start))
   # torch no grad
   with torch.no_grad():
       start = time.time()
       my_function()
       end = time.time()
       print("{:.2f} s".format(end-start))
   # empty cache  
   with torch.no_grad():
       torch.cuda.empty_cache()
       start = time.time()
       my_function() 
       end = time.time()
       print("{:.2f} s".format(end-start))  
   print("=> {:.2f} s".format(time.time()-s))  
   print("==============")
   torch.cuda.empty_cache()
   gc.collect()

results

0.09 s
56.81 s
0.14 s
0.11 s
=> 170.90 s
==============
0.13 s
56.78 s
0.11 s
0.14 s
=> 398.07 s
==============
0.11 s
56.83 s
0.08 s
0.14 s
=> 625.42 s
==============
0.11 s
57.49 s
0.11 s
0.11 s
=> 854.32 s
==============
0.13 s
56.77 s
0.11 s
0.13 s
=> 1081.45 s
==============

Could you post a minimal, executable code snippet as well as the output of python -m torch.utils.collect_env, please?

minimal example :


import numpy as np
import time
import torch
import torch.nn.functional as functional

device = torch.device('cuda') if torch.cuda.is_available() else torch.device("cpu")
imgs = torch.rand((100, 1, 1024, 1024))
kernel = torch.rand((1, 1, 256, 256))

def my_function():
   imgs_gpu = imgs.to(device)
   kernel_gpu = kernel.to(device)
   conv_gpu = functional.conv2d(imgs_gpu, kernel_gpu, padding="valid")
   imgs_cpu = imgs.squeeze().cpu().numpy()

torch.cuda.synchronize()
start = time.time()
my_function()
torch.cuda.synchronize()
end = time.time()
print("{:.2f} s".format(end-start))

python -m torch.utils.collect_env

Collecting environment information...
PyTorch version: 1.11.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Professionnel N
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19041-SP0
Is CUDA available: True
CUDA runtime version: 11.1.105

GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2

[pip3] numpydoc==1.1.0

[pip3] torch==1.11.0+cu113

[pip3] torchvision==0.12.0+cu113
[conda] blas                      1.0                         mkl  

[conda] mkl                       2020.2                      256  

[conda] mkl-service               2.3.0            py38hb782905_0  

[conda] mkl_fft                   1.2.0            py38h45dec08_0  

[conda] mkl_random                1.1.1            py38h47e9c7a_0  

[conda] numpy                     1.19.2           py38hadc3359_0  

[conda] numpy-base                1.19.2           py38ha3acd2a_0  

[conda] numpydoc                  1.1.0              pyhd3eb1b0_1  

[conda] torch                     1.11.0+cu113             pypi_0    pypi

[conda] torchvision               0.12.0+cu113             pypi_0    pypi

@ptrblck Is it enough for you or can I provide you more informations ?

I tried on another machine, it seems to have the same behaviour … Tell me if I am correct but when training a CNN, one convolution of 100*1*1000*1000 with 1*1*256*256 does not take 1 minute on a GPU ?

Julien,

A kernel with size 256 x 256 is very large. Running your code on my machine, it takes 34.26s to complete with a 256x256 kernel, and 0.32s with a 5x5. CNNs do not typically use kernels of that size, in my experience.

and you’re using a GPU ? Even for a 5*5 kernel, 0.32s feels a bit slow to me …