During training in a batch of just 2 images, I end up allocating and then freeing a gigantic 8.7 GB block, which then gets cached by Pytorch and quickly becomes fragmented and thus unfreeable. This leads to OOMs down the line.
I’ve isolated the allocation to the forward pass of a Conv2d in layer2 of the Resnet backbone. E.g. by modifying resnet.py in the torchvision package
s = torch.cuda.memory_snapshot()
for b in s[0]["blocks"]:
print("BEFORE %7.2f MB - %s" % (b["size"] / 1000000., b["state"]))
out = self.conv2(out)
s = torch.cuda.memory_snapshot()
for b in s[0]["blocks"]:
print("AFTER %7.2f MB - %s" % (b["size"] / 1000000., b["state"]))
this prints the size and state of the blocks in the first memory segment. Example output:
BEFORE 117.96 MB - inactive
BEFORE 58.98 MB - active_allocated
BEFORE 886.31 MB - inactive
AFTER 8795.46 MB - inactive
For some reason, an 8.7 GB segment has been allocated, and is sitting totally empty. Can someone help me understand why this is happening? The convolutions involved in Resnet-101 just aren’t that big. I looked at the tensor sizes and it should be ~25 MB for both input and output tensors on this particular layer.
Is PyTorch trying to be “clever” and preemptively allocate a chunk of memory for its cache? If so, what’s the logic that decides this?
This is causing me trouble downstream in the forward pass because I only have 11 GB on the GPU, and with so much memory pinned to “large” allocations, I eventually run of of memory for small allocations and fail.
PyTorch should be able to reuse the memory and I’m unsure if avoiding the allocation would save you from the downstream OOM.
Anyway, the large block might be caused by testing different cudnn kernels and allocating a workspace for them, if you are using torch.backends.cudnn.benchmark = True. Could you disable it and check, if the memory allocation is reduced? Also cudnn.deterministic = True might help as it shouldn’t use any workspace, if I’m not mistaken.
It’s defined by the available algorithm and depends on the memory layout, data type, memory alignment etc.
No, I don’t know how much performance will be lost, as benchmarking is not even working.
I think the right approach would be to limit the cudnn workspace size requirements and skip algos, with a workspace > threshold requirement.
If you are using cudnn>=8.0.5, you could use this env variable as a workaround for now:
CUDNN_CONV_WSCAP_DBG=4096 python script.py args
Where the 4096 are specified in MiB (you can use a lower/higher values, if applicable).
That’s a helpful start! Do you know of any documentation that describes this in some quantitative detail?
I can’t help but feel like 8.7 GB is a crazy amount of memory to convolve a < 25MB tensor, even when running multiple algorithms, but I can’t determine whether it’s a bug or not without being able to understand what the workspace requirements are.