What does torch.backends.cudnn.benchmark do?

Is it possible to interact with CUDNN API from Pytorch. Following function returns the type of algorithm to be used defined by CUDNN_CONVOLUTION_FWD_PREFER_FASTEST:

cudnnGetConvolutionForwardAlgorithm(cudnn,
                                        input_descriptor,
                                        kernel_descriptor,
                                        convolution_descriptor,
                                        output_descriptor,
                                        CUDNN_CONVOLUTION_FWD_PREFER_FASTEST,
                                        /*memoryLimitInBytes=*/0,
                                        &convolution_algorithm)

Or is it possible to use CUDNN logs somehow?

cc @ptrblck who is more experimented with cudnn :slight_smile:

FYI This issue is still open on github

Should this be enabled:-

  • During training?
  • During inference? or
  • Or both.
    I care only about inference speed.

Thankyou

If your input always have the same size, you should enable it all the time.
But it will influence which algorithm cudnn is using only while the flag is enabled. So setting it during training does not influence inference in any way.

@fmassa @ptrblck
Hello. I would like to ask a few questions about the behavior of torch.backends.cudnn.benchmark = True.

  1. Does the mini-batch size matter? Many people say that benchmarking uses the same cache if image input size is the same. However, I have not found a clear explanation of whether changing batch size is OK.
  2. How many caches can it manage? For example, I might have two types of input: 224x224 and 320x320. Would changing between the two types of images constantly require additional benchmarking or would there be two separate caches?

Thank you in advance for your replies!

1 Like
  1. Yes, the batch size matters, as the ConvolutionParams will be stored from here.

  2. Itā€™s using a std::unordered_map with the mentioned ConvolutionParams, so no additional benchmarking would be required for these two shapes once they are already profiled.

3 Likes

hello guys, I have a quick question about the torch.backends.cudnn.benchmark = True

When you say the input_size cannot change, does that apply to each convolution layer?

I have a UNet design using dense blocks. Since in a block, input for each layer is different, does that mean I cannot use torch.backends.cudnn.benchmark = True ?
Is there any workaround for dense block so that I can use torch.backends.cudnn.benchmark = True ?

Thanks in advance :slight_smile:

The input shape can change, but each new input shape will rerun cudnnFind to find the fastest kernel for this shape (for all layers with a new input shape) and will add these kernels to a cache.

No, you can use it, but each new input shape will cause a slowdown once.

2 Likes

How can we use in c++?

at::globalContext().setBenchmarkCuDNN(true); should work.

4 Likes

what is the default value?

By default benchmark is set to False.

2 Likes

What about using it at CPU, can it still lead a positive effect?

No, cuDNN is used on NVIDIA GPUs only.

Hello all, I would like to report/mention that I am experiencing out of memory issues when I am already tight on VRAM and then set torch.backends.cudnn.benchmark = True.

Profiling VRAM usage on smaller data shows that after settingtorch.backends.cudnn.benchmark = True, there is a spike of VRAM usage at the beginning. After it drops, the overall footprint is still a bit higher than compared to what I measure with torch.backends.cudnn.benchmark = False. To put this in numbers, peak VRAM usage is ~7GiB with False. Peak with True is 19GiB and it eventually settles on 10GiB. The execution time in my case is 25% shorter.

Without further digging into this, I assume the increase in VRAM is the price that needs to be paid to get faster execution. Is that correct or could the increased VRAM be avoided somehow? In case not, it might be good to at least document it. Thanks

In benchmark mode cuDNN will profile different algorithms with different memory requirements.
If the GPU is running out of memory, the error will be caught here, cleared, and the algorithms deselected. If the cuDNN algorithm was already selected and did not cause an OOM, the memory footprint could increase depending on the algorithm, and next operation might run OOM. You can set the max. allowed cuDNN workspace via CUDNN_CONV_WSCAP_DBG.

Thanks for your reply. I tried setting the env variable CUDNN_CONV_WSCAP_DBG=38000 (My cudnn version is 8.2.1.32). I am testing on A100-40GB GPU and the documentation says the value should be set in MiB. However, I still get OOM. Did you anticipate that limiting the cudnn workspace would help?

I figured that doing the benchmarking on smaller input to figure out the best algorithm, and then just directly setting that algorithm without further benchmarking could help avoiding the VRAM usage peak. But I did not find out how to do this even after digging further into this question and this git issue.

It depends where the OOM was raised. cuDNN itself will not run OOM in its benchmarking as explained before. However, reducing its memory usage could avoid future OOMs but of course depends on the workload.
You could set the env variable to 0 and check if you are still running OOM.