How to resolve “CUDA out of memory” on small complexity model (LBCNN)

I’m having some trouble implementing a Local Binary Convolutional Neural Network (LBCNN) in Pytorch.

Such model uses fixed convolutional binary filters, they are “randomly” set during the instantiation of the net and keep the same ever after. The only parameter updates are over linear weights that combine the feature maps generated by the filters (this operation is implemented with 1x1 convolutions). The network is then expected to have a small number of learnable parameters and, hence, to be easy to train. But happens that I am receiving the following message when trying to train it:

RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 15.90 GiB total capacity; 13.75 GiB already allocated; 7.75 MiB free; 15.09 GiB reserved in total by PyTorch)

Here is the code for network:

class LBCBlock(nn.Module):
    def __init__(self, n_channels=384, n_kernels=512, sparsity=0.1):
        super().__init__()
        self.n_channels = n_channels
        self.n_kernels = n_kernels

        self.conv_filter = nn.Conv2d(n_channels, n_kernels, kernel_size=3, padding=1, bias=False)
        kernels = torch.tensor(new_kernel(n_channels, n_kernels, sparsity)).type('torch.FloatTensor')
        kernels.requires_grad_()
        self.conv_filter.weight = nn.Parameter(kernels)
        self.conv_filter.weight.detach_()

        self.weighted_sum = nn.Conv2d(n_kernels, n_channels, kernel_size=1)

        self.shortcut = nn.Sequential()

    def forward(self, x):
        out = torch.relu(self.conv_filter(x))
        out = self.weighted_sum(out)
        out += self.shortcut(x)

        return out

img_width = 32
img_height = 32
class NetLBC(nn.Module):
    def __init__(self, lbc_filters=512, n_channels=384, n_blocks=10, sparsity=0.1):
        super().__init__()
        self.n_channels = n_channels
        self.n_blocks = n_blocks

        self.conv1 = nn.Conv2d(3, n_channels, kernel_size=3, padding=1)
        self.lbc_blocks = nn.Sequential(*( [LBCBlock(n_channels, lbc_filters, sparsity) for i in range(n_blocks)]))

        self.fc1 = nn.Linear(self.n_channels * (img_height//2) * (img_width//2), 384)
        self.fc2 = nn.Linear(384, 10)

    def forward(self, x):
        out = self.conv1(x)
        out = self.lbc_blocks(out)
        out = torch.nn.functional.max_pool2d(out, 2)
    
        out = out.view(-1, self.n_channels * (img_height//2) * (img_width//2))

        out = torch.relu(self.fc1(out))
        out = self.fc2(out)

        return out

model = NetLBC(lbc_filters=512, n_channels=384, n_blocks=50, sparsity=0.1).to(device=device)

Assuming the 1x1 convolutions are implemented via the self.fc1 layer in NetLBC, I understand that all preceding layers should not be updated.
If that’s the case, you could wrap the first part of the model into a with torch.no_grad() block to avoid storing intermediate activations, which would be needed to compute the gradients.
Also, lowering the batch size would decrease the memory usage and should help.

Thank you for the help!

Actually, the 1x1 convolutions are in the self.weighted_sum layer in LBCBlock. I’ve tried to follow your advise changing the implementation of this block from

        self.conv_filter = nn.Conv2d(n_channels, n_kernels, kernel_size=3, padding=1, bias=False)
        kernels = torch.tensor(new_kernel(n_channels, n_kernels, sparsity)).type('torch.FloatTensor')
        kernels.requires_grad_()
        self.conv_filter.weight = nn.Parameter(kernels)
        self.conv_filter.weight.detach_()

        self.weighted_sum = nn.Conv2d(n_kernels, n_channels, kernel_size=1)

to

        with torch.no_grad():
            self.conv_filter = nn.Conv2d(n_channels, n_kernels, kernel_size=3, padding=1, bias=False)
            kernels = torch.tensor(new_kernel(n_channels, n_kernels, sparsity)).type('torch.FloatTensor')
            # kernels.requires_grad_()
            self.conv_filter.weight = nn.Parameter(kernels)
            # self.conv_filter.weight.detach_()

        self.weighted_sum = nn.Conv2d(n_kernels, n_channels, kernel_size=1)

and also decreasing the batch size from 64 to 16. Nevertheless, the same error still happens, wich I understand as my model is still too large.

I keep thinking that something is not working the way I expect it to be, since one of the main purpose of using a LBCNN is to have a model with fewer learnable parameters. (If you’re interested: [1608.06049] Local Binary Convolutional Neural Networks)

Even so, thank you again!

The torch.no_grad() guard should be used in the forward pass and include all operations, which Autograd should not track.
Could you add it to the forward and check, if this would lower the memory usage?