8.7 GB CUDA block allocated and then freed by Conv2d forward

I’m training a DETR network with a Resnet101 backbone, using https://github.com/facebookresearch/detr model & loss, but my own data loader and training code.

During training in a batch of just 2 images, I end up allocating and then freeing a gigantic 8.7 GB block, which then gets cached by Pytorch and quickly becomes fragmented and thus unfreeable. This leads to OOMs down the line.

I’ve isolated the allocation to the forward pass of a Conv2d in layer2 of the Resnet backbone. E.g. by modifying resnet.py in the torchvision package

        s = torch.cuda.memory_snapshot()
        for b in s[0]["blocks"]:
            print("BEFORE %7.2f MB - %s" % (b["size"] / 1000000., b["state"]))

        out = self.conv2(out)
        s = torch.cuda.memory_snapshot()
        for b in s[0]["blocks"]:
            print("AFTER  %7.2f MB - %s" % (b["size"] / 1000000., b["state"]))

this prints the size and state of the blocks in the first memory segment. Example output:

BEFORE  117.96 MB - inactive
BEFORE   58.98 MB - active_allocated
BEFORE  886.31 MB - inactive
AFTER  8795.46 MB - inactive

For some reason, an 8.7 GB segment has been allocated, and is sitting totally empty. Can someone help me understand why this is happening? The convolutions involved in Resnet-101 just aren’t that big. I looked at the tensor sizes and it should be ~25 MB for both input and output tensors on this particular layer.

Is PyTorch trying to be “clever” and preemptively allocate a chunk of memory for its cache? If so, what’s the logic that decides this?

This is causing me trouble downstream in the forward pass because I only have 11 GB on the GPU, and with so much memory pinned to “large” allocations, I eventually run of of memory for small allocations and fail.

PyTorch should be able to reuse the memory and I’m unsure if avoiding the allocation would save you from the downstream OOM.
Anyway, the large block might be caused by testing different cudnn kernels and allocating a workspace for them, if you are using torch.backends.cudnn.benchmark = True. Could you disable it and check, if the memory allocation is reduced? Also cudnn.deterministic = True might help as it shouldn’t use any workspace, if I’m not mistaken.

Thanks @ptrblck ! Setting benchmark to False avoids the large allocations, and allows me to complete the training without OOM.

I’d like to understand a bit more about the cuDNN benchmark memory requirements. How is the required workspace size determined?

Presumably I’m giving up some possible performance by not benchmarking. Do you have any feel for how big a difference this makes?

Thanks for the update!

It’s defined by the available algorithm and depends on the memory layout, data type, memory alignment etc.

No, I don’t know how much performance will be lost, as benchmarking is not even working.

I think the right approach would be to limit the cudnn workspace size requirements and skip algos, with a workspace > threshold requirement.
If you are using cudnn>=8.0.5, you could use this env variable as a workaround for now:

CUDNN_CONV_WSCAP_DBG=4096 python script.py args

Where the 4096 are specified in MiB (you can use a lower/higher values, if applicable).

1 Like

That’s a helpful start! Do you know of any documentation that describes this in some quantitative detail?

I can’t help but feel like 8.7 GB is a crazy amount of memory to convolve a < 25MB tensor, even when running multiple algorithms, but I can’t determine whether it’s a bug or not without being able to understand what the workspace requirements are.

With the hint about cudnn.benchmark, I have a minimal example that reproduces the behavior, if anyone is curious

import torch
import torch.nn as nn


def pprint_snapshot():
    s = torch.cuda.memory_snapshot()
    for seg in s:
        print("%7.2f | %7.2f MB - %s" % (
            seg["active_size"] / 1000000., seg["total_size"] / 1000000., seg["segment_type"]))
        for b in seg["blocks"]:
            print("    %7.2f MB - %s" % (b["size"] / 1000000., b["state"]))


torch.backends.cudnn.benchmark = True
conv = nn.Conv2d(128, 128, 3, padding=1).to("cuda:0")
input = torch.randn((2, 128, 90, 160), dtype=torch.float32).to("cuda:0")
pprint_snapshot()
output = conv(input)
print("output done")
pprint_snapshot()