GPU out of memory due to large memory allocation

Hi PyTorch community,

I have a Tesla V100-SXM2 with 32GB memory and get some oom issues when I try to run convolution on a huge input tensor.
Here is the script to reproduce my issue

import torch.nn as nn
import torch
module = nn.Conv2d(563, 3, kernel_size=3, stride=1, padding=1, device='cuda', dtype=torch.half)
W, H = 2048, 2048
input = torch.rand((1, 563, W, H), device='cuda', dtype=torch.half)
output = module(input)
print(torch.cuda.memory_summary())

I understand it is a very large tensor but when I half the tensor size to 1x563x2048x1024 Pytorch memory summary reports a peak usage of 4824 MB. Shouldn’t I be expecting around 9648 MB peak memory usage for 1x563x2048x2048? I tired sizes like 1x563x1024x1024, 1x563x1024x512, 1x563x512x512 and peak memory vs tensor size relationship is pretty linear there.
Here is the error I encountered:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 39.59 GiB (GPU 0; 31.75 GiB total capacity; 4.42 GiB already allocated; 26.53 GiB free; 4.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Why does Pytorch tries to allocate so much memory? Is there a way to reduce memory PyTorch is trying to allocate if it doesn’t need so much?
Thank you!

You are correct that in theory the memory usage should scale linearly. However, you are crossing over a support boundary of cuDNN in your examples, namely 563x1024x1024x3 is smaller than INT_MAX, while 563x1024x1024x3 is greater than INT_MAX. Since cuDNN currently doesn’t support inputs with more than INT_MAX elements, this workload will be dispatched to a native “im2col” style implementation instead which will allocate much more memory to create the “col” tensor. We’ve requested support from cuDNN for these cases but don’t have an estimated completion date yet.

See e.g., this upstream issue for more details:

Thank you for your response. Do we now have a timeline for adding 64-bit indexing?