I am getting worse profiling results with cudnn.benchmark = True
on my toy example (see below), and was wondering if this is user error or a bug / incompatible build?
My understanding is that, esp. in combination with cudnn.benchmark_limit = 0
, cudnn should brute-force through absolutely all available choices and pick the fastest one?
Other slowdown-related topics (but no match)
- Cudnn.benchmark slowing execution down - OP included warmup time in measurement; second user possibly ran into repeated benchmarks due to varying input shapes
System information:
- GPU: 1x NVIDIA L4
- Driver Version: 560.35.03
- Ubuntu 24.04 (Docker)
- Python 3.11.11
- Pytorch 2.5.1 (built from source):
- CUDA 12.6
- cuDNN 9.6.0.74-1
Pytorch details:
print(torch.__config__.show())
PyTorch built with:
- GCC 13.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2025.0.1-Product Build 20241031 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.5.3 (Git Hash 66f0cb9eb66affd2da3bf5f8d897376f04aae6af)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX512
- CUDA Runtime 12.6
- NVCC architecture flags: -gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_89,code=sm_89
- CuDNN 90.6
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.6, CUDNN_VERSION=9.6.0, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS=-O2 -pipe -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.5.0, USE_CUDA=1, USE_CUDNN=ON, USE_CUSPARSELT=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
profile.py
import torch
from torch.backends import cudnn
kernel = torch.rand(64, 3, 3, 3, device='cuda')
cudnn.benchmark = False
cudnn.benchmark_limit = 0
torch.cuda.synchronize()
for _ in range(50):
data = torch.rand(64,3,224,224, device='cuda')
_ = torch.nn.functional.conv2d(data, kernel)
torch.cuda.synchronize()
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
) as prof:
for _ in range(100):
data = torch.rand(64,3,224,224, device='cuda')
_ = torch.nn.functional.conv2d(data, kernel)
print(prof.key_averages().table(sort_by='self_cuda_time_total', row_limit=10))
Output (cudnn.benchmark=False):
python cudnn_bench.py
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::cudnn_convolution 0.58% 2.199ms 0.92% 3.488ms 34.876us 362.741ms 95.90% 362.741ms 3.627ms 100
_5x_cudnn_ampere_scudnn_128x64_relu_xregs_large_nn_v... 0.00% 0.000us 0.00% 0.000us 0.000us 362.427ms 95.82% 362.427ms 3.624ms 100
aten::uniform_ 0.28% 1.048ms 0.50% 1.914ms 19.137us 15.509ms 4.10% 15.509ms 155.086us 100
void at::native::(anonymous namespace)::distribution... 0.00% 0.000us 0.00% 0.000us 0.000us 15.509ms 4.10% 15.509ms 155.086us 100
void cask__5x_cudnn::computeOffsetsKernel<false, fal... 0.00% 0.000us 0.00% 0.000us 0.000us 313.564us 0.08% 313.564us 3.136us 100
aten::rand 0.10% 382.634us 1.42% 5.401ms 54.012us 0.000us 0.00% 15.509ms 155.086us 100
aten::empty 0.82% 3.105ms 0.82% 3.105ms 31.049us 0.000us 0.00% 0.000us 0.000us 100
cudaStreamIsCapturing 0.03% 119.870us 0.03% 119.870us 0.599us 0.000us 0.00% 0.000us 0.000us 200
cudaLaunchKernel 0.52% 1.984ms 0.52% 1.984ms 6.615us 0.000us 0.00% 0.000us 0.000us 300
aten::conv2d 0.04% 134.365us 1.11% 4.220ms 42.204us 0.000us 0.00% 362.741ms 3.627ms 100
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 379.873ms
Self CUDA time total: 378.249ms
Output (cudnn.benchmark=True and cudnn.benchmark_limit=0)
python cudnn_bench.py
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::cudnn_convolution 0.40% 2.042ms 0.65% 3.291ms 32.912us 492.278ms 97.28% 492.278ms 4.923ms 100
_5x_cudnn_ampere_scudnn_128x128_relu_small_nn_v1 0.00% 0.000us 0.00% 0.000us 0.000us 491.961ms 97.22% 491.961ms 4.920ms 100
aten::uniform_ 0.20% 1.018ms 0.37% 1.885ms 18.845us 13.759ms 2.72% 13.759ms 137.587us 100
void at::native::(anonymous namespace)::distribution... 0.00% 0.000us 0.00% 0.000us 0.000us 13.759ms 2.72% 13.759ms 137.587us 100
void cask__5x_cudnn::computeOffsetsKernel<false, fal... 0.00% 0.000us 0.00% 0.000us 0.000us 317.248us 0.06% 317.248us 3.172us 100
aten::rand 0.06% 295.520us 1.04% 5.258ms 52.584us 0.000us 0.00% 13.759ms 137.587us 100
aten::empty 0.61% 3.078ms 0.61% 3.078ms 30.784us 0.000us 0.00% 0.000us 0.000us 100
cudaStreamIsCapturing 0.02% 118.718us 0.02% 118.718us 0.594us 0.000us 0.00% 0.000us 0.000us 200
cudaLaunchKernel 0.38% 1.945ms 0.38% 1.945ms 6.484us 0.000us 0.00% 0.000us 0.000us 300
aten::conv2d 0.02% 123.625us 0.79% 3.994ms 39.936us 0.000us 0.00% 492.278ms 4.923ms 100
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 507.724ms
Self CUDA time total: 506.037ms
Additional info:
I noticed that I get the fast algorithm in both cases if I add one more call of this convolution at the beginning of the script, before setting cudnn.benchmark = True
.
Does the benchmark itself perhaps also require some sort of warmup? I.e. did it ignore the first (fastest) choice because some benchmark-related setup “polluted” that measurement?