Hi all,
I want to do a lot of convolution on GPU (but not in the context of deep learning, there is no dataloader and no model). I have 100 images of size 1000*1000 with 1 kernel 256*256. But I can’t figure out with it is so slow (basically same computation time than on CPU). So, I try to display the computation time/the real time elapsed but I am bit lost as it seems there is some things to take care of like using torch.cuda.synchronize()
, torch.no_grad()
, torch.cuda.empty_cache()
.
So I have written some code below and I am not quite sure what to do but what I know is that currently a convolution on GPU take well too much time ~1 min and I feel like it is not the computation itself but the transfert GPU/CPU.
Does someone have some hint on this ? Is it normal ? Is it because of my CUDA installation ? Do I miss something ?
Thank you !
def my_function():
imgs_gpu = imgs.to(device)
kernel_gpu = kernel.to(device)
conv_gpu = functional.conv2d(imgs_gpu, kernel_gpu, padding="valid")
imgs_cpu = imgs.squeeze().cpu().numpy()
s = time.time()
# computation
for i in range(5):
start = time.time()
my_function()
end = time.time()
print("{:.2f} s".format(end-start))
# computation with synchronize
torch.cuda.synchronize()
start = time.time()
my_function()
torch.cuda.synchronize()
end = time.time()
print("{:.2f} s".format(end-start))
# torch no grad
with torch.no_grad():
start = time.time()
my_function()
end = time.time()
print("{:.2f} s".format(end-start))
# empty cache
with torch.no_grad():
torch.cuda.empty_cache()
start = time.time()
my_function()
end = time.time()
print("{:.2f} s".format(end-start))
print("=> {:.2f} s".format(time.time()-s))
print("==============")
torch.cuda.empty_cache()
gc.collect()
results
0.09 s
56.81 s
0.14 s
0.11 s
=> 170.90 s
==============
0.13 s
56.78 s
0.11 s
0.14 s
=> 398.07 s
==============
0.11 s
56.83 s
0.08 s
0.14 s
=> 625.42 s
==============
0.11 s
57.49 s
0.11 s
0.11 s
=> 854.32 s
==============
0.13 s
56.77 s
0.11 s
0.13 s
=> 1081.45 s
==============
Could you post a minimal, executable code snippet as well as the output of python -m torch.utils.collect_env
, please?
minimal example :
import numpy as np
import time
import torch
import torch.nn.functional as functional
device = torch.device('cuda') if torch.cuda.is_available() else torch.device("cpu")
imgs = torch.rand((100, 1, 1024, 1024))
kernel = torch.rand((1, 1, 256, 256))
def my_function():
imgs_gpu = imgs.to(device)
kernel_gpu = kernel.to(device)
conv_gpu = functional.conv2d(imgs_gpu, kernel_gpu, padding="valid")
imgs_cpu = imgs.squeeze().cpu().numpy()
torch.cuda.synchronize()
start = time.time()
my_function()
torch.cuda.synchronize()
end = time.time()
print("{:.2f} s".format(end-start))
python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.11.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 10 Professionnel N
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A
Python version: 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19041-SP0
Is CUDA available: True
CUDA runtime version: 11.1.105
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] numpydoc==1.1.0
[pip3] torch==1.11.0+cu113
[pip3] torchvision==0.12.0+cu113
[conda] blas 1.0 mkl
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py38hb782905_0
[conda] mkl_fft 1.2.0 py38h45dec08_0
[conda] mkl_random 1.1.1 py38h47e9c7a_0
[conda] numpy 1.19.2 py38hadc3359_0
[conda] numpy-base 1.19.2 py38ha3acd2a_0
[conda] numpydoc 1.1.0 pyhd3eb1b0_1
[conda] torch 1.11.0+cu113 pypi_0 pypi
[conda] torchvision 0.12.0+cu113 pypi_0 pypi
@ptrblck Is it enough for you or can I provide you more informations ?
I tried on another machine, it seems to have the same behaviour … Tell me if I am correct but when training a CNN, one convolution of 100*1*1000*1000 with 1*1*256*256 does not take 1 minute on a GPU ?
Julien,
A kernel with size 256 x 256 is very large. Running your code on my machine, it takes 34.26s to complete with a 256x256 kernel, and 0.32s with a 5x5. CNNs do not typically use kernels of that size, in my experience.
and you’re using a GPU ? Even for a 5*5 kernel, 0.32s feels a bit slow to me …