Depthwise Convolution Memory Leakage

Hi, I have a memory leakage problem with the code below:

import torch
import torch.nn as nn


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.c, self.k, self.p, self.x = 256, 3, 1, 2304
        self.n, self.h, self.w = 0, 0, 0
        self.unfold = nn.Unfold(self.k, padding=self.p)
        self.conv = nn.Conv2d(self.c * self.k * self.k, self.x, 1, padding=0, groups=self.c * self.k * self.k)

    def forward(self, x: torch.Tensor):
        self.n, self.h, self.w = x.size(0), x.size(2), x.size(3)
        c1 = x.view(self.n, self.c, 1, self.h, self.w)

        c2 = self.unfold(x)
        c2 = c2.view(self.n, self.c, self.k * self.k, self.h, self.w)

        out = c1 + c2
        out = out.view(self.n, self.c * self.k * self.k, self.h, self.w)
        return self.conv(out)


if __name__ == '__main__':
    try:
        net = Net().cuda()
        x = torch.randn((1024, 256, 32, 32)).cuda()
        out = net(x)
        print(out)
    except Exception as ex:
        print(f' > CUDA memory: {torch.cuda.memory_allocated() / 1024 ** 3} GiB')
        del x, net
        print(ex)
    print(f' > CUDA memory: {torch.cuda.memory_allocated() / 1024 ** 3} GiB')

Here I define a network with depth-wise convolution as the last layer. I’m using 24GB RTX 3090, so when it runs into self.conv(out), it will surely be out of CUDA memory due to the large tensors. Then we can catch the exception and delete the references of x and net, the expected CUDA memory after del x, net should be zero but I got 9 GB instead, which means that there’s a memory leakage with a failed-to-allocate-output depth-wise convolution. If I change the group number of self.conv into something else, the CUDA memory will be correctly zero.

I get the expected result:

 > CUDA memory: 19.000017166137695 GiB
CUDA out of memory. Tried to allocate 9.00 GiB (GPU 0; 23.69 GiB total capacity; 19.00 GiB already allocated; 1.48 GiB free; 19.00 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
 > CUDA memory: 0.0 GiB

but think that these “hacks” might easily fail (as seen in your setup) and would thus recommend to set a proper batch size before starting the training.
In any case, you could check if deleting out (if it’s even created) might solve your issue.