With extremely large input 1D convolution uses memory inefficiently

I am trying to train a neural network for a very large input (5*100,000,000) and it requires much more memory than expected.
Here is some minimal example:

import torch
    import torch.nn as nn
    import torch.nn.functional as F
    import torch.optim as optim
    import time
    
    class Net(nn.Module):
    
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = nn.Conv1d(in_channels=5, out_channels=1, kernel_size=100000000, stride=10)
    
        def forward(self, x):
            x = self.conv1(x)
            x = torch.sigmoid(x)
            return x
    
    model = Net().cuda()
    
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    criterion = torch.nn.BCELoss()

    data = torch.normal(torch.zeros(1,5,100000000),torch.ones(1,5,100000000))
    data = data.cuda()
    label = torch.ones(1,1,1)
    label = label.cuda()
    
    for epoch in range(10):
        output = model(data)
        loss = criterion(output, label)
       
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        print("Epoch :", epoch)

The input is some random data, it uses approximately 2Gb, as expected (32 bit * 5 * 100,000,000= 1.86Gb).This variable has no gradient.
The network consists of a single convolutional layer with one filter of the same size as an input, so it has 500M weights, that is another 2Gb.
After the forward pass another 2Gb get used.
After loss.backprop() 8Gb are used, after optimizer.step() 12 Gb are used, that is all the available memory.

During the second epoch forward pass runs ok, but during backpropagation I get RuntimeError: CUDA error: out of memory.

What exactly is saved in GPU memory during the epoch? Why the memory is not released after the optimization step is finished? How to reduce memory usage in this case?

I’m not very sure, but I think this maybe because pytorch uses dynamic memory management, something similar to the python automatic garbage collection, to manage and free up GPU memory. Since you have few tensors which occupy a huge chunk of memory, the automatic garbage collection doesn’t kick in.
To make this work, you’ll need to manually delete the output and loss variable at the end of your epoch to free up your GPU memory.

That is a good idea. I tried to delete loss and output, but it had no effect.
The thing is that I don’t know what uses all the memory so I don’t know what to delete.
The issue is similar to this one How to free GPU memory? (and delete memory allocated variables), but there is no solution there.