smth
January 4, 2024, 3:31pm
5
@Zhang_Jiguo while you cannot set the CuDNN algorithm, you can limit the max Workspace size that CuDNN is allowed to take to compute the convolution. That should be sufficient for your purpose.
You can set this workspace size using the environment variable CUDNN_CONV_WSCAP_DBG=4096
, where for example 4096 here specifies 4096 megabytes.
In your code, you can specify this env variable before the import torch
code.
For example:
import os
os.environ["CUDNN_CONV_WSCAP_DBG"] = 4096
import torch
Alternatively, you can specify it on the command-line:
CUDNN_CONV_WSCAP_DBG=4096 python your_script.py
References:
Thanks for the update!
It’s defined by the available algorithm and depends on the memory layout, data type, memory alignment etc.
No, I don’t know how much performance will be lost, as benchmarking is not even working.
I think the right approach would be to limit the cudnn workspace size requirements and skip algos, with a workspace > threshold requirement.
If you are using cudnn>=8.0.5, you could use this env variable as a workaround for now:
CUDNN_CONV_WSCAP_DBG=4096 python script.py ar…
opened 12:36AM - 11 Dec 20 UTC
module: cudnn
module: convolution
triaged
## 🚀 Feature
I am suggesting adding a mechanism to limit the workspace size u… sed by cuDNN.
## Motivation
The workspace size for algorithms returned by cuDNN heuristics could be large sometimes. As a result, users could see CUDA OOM error when using PyTorch with cuDNN, for example, https://discuss.pytorch.org/t/8-7-gb-cuda-block-allocated-and-then-freed-by-conv2d-forward/105478/5. Many of these OOM errors are avoidable if PyTorch has some mechanism to limit the workspace size.
When doing convolution with cuDNN, PyTorch will try all algorithms in the order returned by cuDNN heuristics and will pick up the first algorithm that does not fail. This mechanism can already filter out algorithms that require a memory larger than the free memory on the user's GPU, but OOM could still happen if, for example, on a 40GB GPU, the first layer uses 36GB workspace, and then the second layer fails with OOM because only 4GB is left on device.
This is not a great user experience, we should make PyTorch smarter at picking algorithms.
## Pitch
While I do believe adding a mechanism to limit the workspace size is great, I am not sure what is the best approach to limit it. Here I will list all options I can think of, and I want to discuss which is the best. My personal choice would be either option 2 or option 3.
### Option 1: Hard code the limitation
We can hard code the limitation to a certain value, e.g. `4GiB`.
**Pro**: Easy to implement and maintain.
**Con**: Does not deliver the best user experience. A single hardcoded value doesn't automatically work on all devices, and it doesn't give the user the flexibility to adjust it. Also, if the input tensor is large, the hardcoded limit could make it impossible to find a valid algorithm to run it. (What if I have an 80GB GPU, and want to do a conv of a 20GB input tensor and am OK with using 20GB workspace?)
### Option 2: Implement some heuristics
We can implement some heuristics to dynamically limit the workspace size. For example, we can make the limit something like
`limit = max(0.5 * input_size, 4GB)`.
**Pro**: A good heuristics could make PyTorch automatically work for most cases and deliver the best user experience
**Con**: Validation a heuristics could be hard. Do we have enough test cases to validate it? Will the heuristics make some user's working code no longer working?
### Option 3: Single limitation for all cases adjustable by users
Similar to option 1, but we can introduce something like `torch.backends.cudnn.workspace_limit` to allow users to adjust this value.
**Pro**: Still easy to implement and maintain.
**Con**: Similar to option 1, a single limit value might not work with large inputs. Asking users to modify `workspace_limit` could lead to confusion to users.
### Option 4: Heuristics with parameters adjustable by users
Similar to option 2, but we can introduce something like `torch.backends.cudnn.alpha` to allow users to adjust this value. And we will have `limit = max(alpha * input_size, 1GB)`
**Pro**: Users can adjust values to maximize their batch size.
**Con**: Can be very confusing and hard for backward compatibility.
## Alternatives
Starting from v8.0.5, cuDNN allows specifying the maximum workspace size by using `CUDNN_CONV_WSCAP_DBG` environmental variable, see https://docs.nvidia.com/deeplearning/cudnn/release-notes/rel_8.html#rel-805
cc: @ptrblck @ngimel
cc @csarofeen @ptrblck