What does torch.backends.cudnn.benchmark do?

Shubhankar · January 13, 2020, 6:33pm

Is it possible to interact with CUDNN API from Pytorch. Following function returns the type of algorithm to be used defined by CUDNN_CONVOLUTION_FWD_PREFER_FASTEST:

cudnnGetConvolutionForwardAlgorithm(cudnn,
                                        input_descriptor,
                                        kernel_descriptor,
                                        convolution_descriptor,
                                        output_descriptor,
                                        CUDNN_CONVOLUTION_FWD_PREFER_FASTEST,
                                        /*memoryLimitInBytes=*/0,
                                        &convolution_algorithm)

Or is it possible to use CUDNN logs somehow?

albanD · January 13, 2020, 7:29pm

cc @ptrblck who is more experimented with cudnn

Shubhankar · January 13, 2020, 7:30pm

Shubhankar · January 15, 2020, 3:30pm

FYI This issue is still open on github

mohit117 · December 29, 2020, 6:12am

Should this be enabled:-

During training?
During inference? or
Or both.
I care only about inference speed.

Thankyou

albanD · December 30, 2020, 10:21am

If your input always have the same size, you should enable it all the time.
But it will influence which algorithm cudnn is using only while the flag is enabled. So setting it during training does not influence inference in any way.

veritas · January 20, 2021, 12:05pm

@fmassa @ptrblck
Hello. I would like to ask a few questions about the behavior of torch.backends.cudnn.benchmark = True.

Does the mini-batch size matter? Many people say that benchmarking uses the same cache if image input size is the same. However, I have not found a clear explanation of whether changing batch size is OK.
How many caches can it manage? For example, I might have two types of input: 224x224 and 320x320. Would changing between the two types of images constantly require additional benchmarking or would there be two separate caches?

Thank you in advance for your replies!

ptrblck · January 20, 2021, 11:22pm

Yes, the batch size matters, as the ConvolutionParams will be stored from here.
It’s using a std::unordered_map with the mentioned ConvolutionParams, so no additional benchmarking would be required for these two shapes once they are already profiled.

Michael-H777 · March 20, 2021, 8:24pm

hello guys, I have a quick question about the torch.backends.cudnn.benchmark = True

When you say the input_size cannot change, does that apply to each convolution layer?

I have a UNet design using dense blocks. Since in a block, input for each layer is different, does that mean I cannot use torch.backends.cudnn.benchmark = True ?
Is there any workaround for dense block so that I can use torch.backends.cudnn.benchmark = True ?

Thanks in advance

ptrblck · March 21, 2021, 7:34am

The input shape can change, but each new input shape will rerun cudnnFind to find the fastest kernel for this shape (for all layers with a new input shape) and will add these kernels to a cache.

No, you can use it, but each new input shape will cause a slowdown once.

Martin_JH · October 18, 2021, 8:57am

How can we use in c++?

ptrblck · October 18, 2021, 9:01am

at::globalContext().setBenchmarkCuDNN(true); should work.

divinho · May 26, 2023, 2:46pm

what is the default value?

ptrblck · May 26, 2023, 5:05pm

By default benchmark is set to False.

Apdullah_Yayik · June 3, 2023, 8:07pm

What about using it at CPU, can it still lead a positive effect?

ptrblck · June 3, 2023, 8:52pm

No, cuDNN is used on NVIDIA GPUs only.

Paloha · February 22, 2024, 12:35am

Hello all, I would like to report/mention that I am experiencing out of memory issues when I am already tight on VRAM and then set torch.backends.cudnn.benchmark = True.

Profiling VRAM usage on smaller data shows that after settingtorch.backends.cudnn.benchmark = True, there is a spike of VRAM usage at the beginning. After it drops, the overall footprint is still a bit higher than compared to what I measure with torch.backends.cudnn.benchmark = False. To put this in numbers, peak VRAM usage is ~7GiB with False. Peak with True is 19GiB and it eventually settles on 10GiB. The execution time in my case is 25% shorter.

Without further digging into this, I assume the increase in VRAM is the price that needs to be paid to get faster execution. Is that correct or could the increased VRAM be avoided somehow? In case not, it might be good to at least document it. Thanks

ptrblck · February 22, 2024, 1:50am

In benchmark mode cuDNN will profile different algorithms with different memory requirements.
If the GPU is running out of memory, the error will be caught here, cleared, and the algorithms deselected. If the cuDNN algorithm was already selected and did not cause an OOM, the memory footprint could increase depending on the algorithm, and next operation might run OOM. You can set the max. allowed cuDNN workspace via CUDNN_CONV_WSCAP_DBG.

Paloha · February 24, 2024, 12:52am

Thanks for your reply. I tried setting the env variable CUDNN_CONV_WSCAP_DBG=38000 (My cudnn version is 8.2.1.32). I am testing on A100-40GB GPU and the documentation says the value should be set in MiB. However, I still get OOM. Did you anticipate that limiting the cudnn workspace would help?

I figured that doing the benchmarking on smaller input to figure out the best algorithm, and then just directly setting that algorithm without further benchmarking could help avoiding the VRAM usage peak. But I did not find out how to do this even after digging further into this question and this git issue.

ptrblck · February 24, 2024, 3:40pm

It depends where the OOM was raised. cuDNN itself will not run OOM in its benchmarking as explained before. However, reducing its memory usage could avoid future OOMs but of course depends on the workload.
You could set the env variable to 0 and check if you are still running OOM.