F.conv2d with large tensor。Tried to allocate 81.00 GiB

Zhang_Jiguo · January 4, 2024, 9:29am

Environment

Python 3.10.9
torch 2.0.1
gpu: Tesla P40

import torch
import torch.nn.functional as F
a = torch.randn(1, 256, 3072, 3072).cuda() #  need  9G+ 
b = torch.randn(256, 256, 3, 3).cuda()
c = torch.randn(256).cuda()                                                                                                                                   
y=F.conv2d(a, b, c, (1,1), (1,1), (1,1), 1) # try to allocate 81G

q:
why f.conv2d need 81G vram?
is there any method to calculate conv2d slowly without oom and don’t change original result?
thxs

albanD · January 4, 2024, 10:57am

Hey!

The default algorithm needs a workspace proportional to the kernel size which is massive here due to the number of channels.
One thing you should try is install cudnn on your machine, and make sure it is available torch.backends.cudnn.version(). IIRC it has at least one convolution algorithm that does not require any extra workspace.

Zhang_Jiguo · January 4, 2024, 12:43pm

cudnn is available;
how to set conv2d algorithm in pytorch. thxs

smth · January 4, 2024, 3:31pm

@Zhang_Jiguo while you cannot set the CuDNN algorithm, you can limit the max Workspace size that CuDNN is allowed to take to compute the convolution. That should be sufficient for your purpose.

You can set this workspace size using the environment variable CUDNN_CONV_WSCAP_DBG=4096, where for example 4096 here specifies 4096 megabytes.
In your code, you can specify this env variable before the import torch code.
For example:

import os
os.environ["CUDNN_CONV_WSCAP_DBG"] = 4096
import torch

Alternatively, you can specify it on the command-line:

CUDNN_CONV_WSCAP_DBG=4096 python your_script.py

References:

github.com/pytorch/pytorch

Provide a mechanism to limit the workspace size of cudnn convolution

opened 12:36AM - 11 Dec 20 UTC

zasdfgbnm

module: cudnn module: convolution triaged

## 🚀 Feature I am suggesting adding a mechanism to limit the workspace size u…sed by cuDNN. ## Motivation The workspace size for algorithms returned by cuDNN heuristics could be large sometimes. As a result, users could see CUDA OOM error when using PyTorch with cuDNN, for example, https://discuss.pytorch.org/t/8-7-gb-cuda-block-allocated-and-then-freed-by-conv2d-forward/105478/5. Many of these OOM errors are avoidable if PyTorch has some mechanism to limit the workspace size. When doing convolution with cuDNN, PyTorch will try all algorithms in the order returned by cuDNN heuristics and will pick up the first algorithm that does not fail. This mechanism can already filter out algorithms that require a memory larger than the free memory on the user's GPU, but OOM could still happen if, for example, on a 40GB GPU, the first layer uses 36GB workspace, and then the second layer fails with OOM because only 4GB is left on device. This is not a great user experience, we should make PyTorch smarter at picking algorithms. ## Pitch While I do believe adding a mechanism to limit the workspace size is great, I am not sure what is the best approach to limit it. Here I will list all options I can think of, and I want to discuss which is the best. My personal choice would be either option 2 or option 3. ### Option 1: Hard code the limitation We can hard code the limitation to a certain value, e.g. `4GiB`. **Pro**: Easy to implement and maintain. **Con**: Does not deliver the best user experience. A single hardcoded value doesn't automatically work on all devices, and it doesn't give the user the flexibility to adjust it. Also, if the input tensor is large, the hardcoded limit could make it impossible to find a valid algorithm to run it. (What if I have an 80GB GPU, and want to do a conv of a 20GB input tensor and am OK with using 20GB workspace?) ### Option 2: Implement some heuristics We can implement some heuristics to dynamically limit the workspace size. For example, we can make the limit something like `limit = max(0.5 * input_size, 4GB)`. **Pro**: A good heuristics could make PyTorch automatically work for most cases and deliver the best user experience **Con**: Validation a heuristics could be hard. Do we have enough test cases to validate it? Will the heuristics make some user's working code no longer working? ### Option 3: Single limitation for all cases adjustable by users Similar to option 1, but we can introduce something like `torch.backends.cudnn.workspace_limit` to allow users to adjust this value. **Pro**: Still easy to implement and maintain. **Con**: Similar to option 1, a single limit value might not work with large inputs. Asking users to modify `workspace_limit` could lead to confusion to users. ### Option 4: Heuristics with parameters adjustable by users Similar to option 2, but we can introduce something like `torch.backends.cudnn.alpha` to allow users to adjust this value. And we will have `limit = max(alpha * input_size, 1GB)` **Pro**: Users can adjust values to maximize their batch size. **Con**: Can be very confusing and hard for backward compatibility. ## Alternatives Starting from v8.0.5, cuDNN allows specifying the maximum workspace size by using `CUDNN_CONV_WSCAP_DBG` environmental variable, see https://docs.nvidia.com/deeplearning/cudnn/release-notes/rel_8.html#rel-805 cc: @ptrblck @ngimel cc @csarofeen @ptrblck

ptrblck · January 4, 2024, 4:24pm

cuDNN won’t be used due to the large input size. needs_64bit_indexing_no_split will return true and thus use_cudnn will return false forcing the fallback to slow_conv2d_forward_cuda.

Zhang_Jiguo · January 5, 2024, 6:36am

env CUDNN_CONV_WSCAP_DBG=4096 python test_conv.py # still need 81G vram

can i setup the slow_conv2d_forward_cuda to run without OOM?

smth · January 5, 2024, 2:40pm

yes, as @ptrblck replied, the Conv is so large that CuDNN doesn’t support it, Hence it is still running via the slow_conv2d path that takes lots of working memory.

One way you can solve this is by tiling the convolution into patches.
Each individual patch of convolution actually ends up using CuDNN (and will be quite fast), and the overall computation will be exactly the same.

Here’s some sample code that I verified works correctly (it gives same output between F.conv2d and tiled_conv2d)

Here’s the tiled conv function

def tiled_conv2d(input, weight, bias, tile_size):
    """Compute the exact same function as Conv2D, but instead do it tile by tile, and account for border effects"""   
    # Initialize the output tensor
    y_full = torch.zeros(input.size(0), input.size(1), input.size(2), input.size(3)).cuda()
    
    overlap = weight.size(2) - 1  # Kernel size - 1

    for i in range(0, input.shape[2], tile_size):
        for j in range(0, input.shape[3], tile_size):
            # Calculate the region of interest with overlap
            start_i = max(i - overlap // 2, 0)
            end_i = min(i + tile_size + overlap // 2, input.shape[2])
            start_j = max(j - overlap // 2, 0)
            end_j = min(j + tile_size + overlap // 2, input.shape[3])

            # Extract the tile
            tile = input[:, :, start_i:end_i, start_j:end_j]

            # Process the tile
            conv_tile = F.conv2d(tile, weight, bias, (1, 1), (1, 1), (1, 1), 1)

            # Determine the region in the output tensor to update
            # Adjust the placement considering the overlap
            output_start_i = i
            output_end_i = i + tile_size if i + tile_size <= input.shape[2] else input.shape[2]
            output_start_j = j
            output_end_j = j + tile_size if j + tile_size <= input.shape[3] else input.shape[3]

            # Adjust the slicing of the convolved tile to match the output size
            tile_i_start = 0 if i == 0 else overlap // 2
            tile_i_end = conv_tile.shape[2] - (0 if i + tile_size >= input.shape[2] else overlap // 2)
            tile_j_start = 0 if j == 0 else overlap // 2
            tile_j_end = conv_tile.shape[3] - (0 if j + tile_size >= input.shape[3] else overlap // 2)

            # Place the convolved tile in the output tensor
            y_full[:, :, output_start_i:output_end_i, output_start_j:output_end_j] = conv_tile[:, :, tile_i_start:tile_i_end, tile_j_start:tile_j_end]

    # y_full now contains the full convolved image
    return y_full

And here’s using it

import torch
import torch.nn.functional as F

# Original image and kernel
input_size = 3072              # change this to a smaller size if you want to verify the correctness of F.conv2d  with tiled_conv2d
input = torch.randn(1, 256, input_size, input_size).cuda()
weight = torch.randn(256, 256, 3, 3).cuda()
bias = torch.randn(256).cuda()


tile_size = 256 # use 256 x 256 image patches
y_full = tiled_conv2d(input, weight, bias, tile_size)

# Use a smaller input_size, and uncomment the next two lines to verify the correctness of tiled_conv2d
# y = F.conv2d(input, weight, bias, (1,1), (1,1), (1,1), 1) # OOMs
# torch.allclose(y, y_full)

Zhang_Jiguo · January 8, 2024, 2:35am

THXS, it work on p40 with tolerance

torch.allclose(y, y_full, rtol=1e-3,atol=1e-3) # True
torch.allclose(y, y_full) # False