Smaller model results in more cuda memory consumption

Hi there !

I am facing a strange behaviour from cuda when using a handmade model. I have been using variants of resnet for some time now, but as I am overfitting I decided to create a smaller model to see how it goes.
The thing is, when I am using resnet18 (which has around 11M parameters) it uses around 3GB of cuda memory (witch batches of 64 images of size 224*224). However, when I use my custom model that has around 4M parameters, it uses almost 10GB of cuda memory. My model is the followiong:

GradesClassifModel(
  (base_model): Sequential(
    (0): Sequential(
      (0): Sequential(
        (0): ConvBnRelu(
          (conv): Conv2d(3, 64, kernel_size=(5, 5), stride=(1, 1))
          (bn): BatchNorm2d(64, eps=1e-05, momentum=0.01, affine=True, track_running_stats=True)
          (relu): ReLU()
        )
        (1): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
      )
      (1): Sequential(
        (0): ConvBnRelu(
          (conv): Conv2d(64, 128, kernel_size=(5, 5), stride=(1, 1))
          (bn): BatchNorm2d(128, eps=1e-05, momentum=0.01, affine=True, track_running_stats=True)
          (relu): ReLU()
        )
        (1): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
      )
      (2): Sequential(
        (0): ConvBnRelu(
          (conv): Conv2d(128, 256, kernel_size=(5, 5), stride=(1, 1))
          (bn): BatchNorm2d(256, eps=1e-05, momentum=0.01, affine=True, track_running_stats=True)
          (relu): ReLU()
        )
        (1): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
      )
      (3): Sequential(
        (0): ConvBnRelu(
          (conv): Conv2d(256, 512, kernel_size=(5, 5), stride=(1, 1))
          (bn): BatchNorm2d(512, eps=1e-05, momentum=0.01, affine=True, track_running_stats=True)
          (relu): ReLU()
        )
        (1): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
      )
    )
  )
  (head): Sequential(
    (0): AdaptiveAvgPool2d(output_size=1)
    (1): Flatten()
    (2): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.25, inplace=False)
    (4): Linear(in_features=512, out_features=512, bias=True)
    (5): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): Dropout(p=0.5, inplace=False)
    (7): Linear(in_features=512, out_features=2, bias=True)
  )
)

Does anyone have an idea why this occurs ? I have no clue if this is normal due to some obscure autograd behaviour or if I did something wrong.

Besides the parameters, the activations will consume a large portion of the memory.
Are you using less pooling layers (or other setups) in your custom model, which could yield larger output activations?

Found the problem. In the resnet I use, there is no pooling layer, just strided convolutions for downsampling while my small custom network uses max pool. By removing max pool and adding strided conv instead, I managed to lower the memory usage to something acceptable. My experiment showed these memory consumptions:

  • resnet18:
    • 11M parameters
    • 855 MiB of memory with model and batch loaded only
    • 2755 MiB after forward pass
    • 3537 MiB after backward pass
  • small network with max pool:
    • 400K parameters
    • 811 MiB with model and batch only
    • 4329 MiB after forward
    • 5059 MiB after backward
  • small network with strided conv:
    • 400K parameters
    • 811 MiB with model and batch only
    • 1615 MiB after forward
    • 1755 MiB after backward

The conclusion to all this is that max pool takes a huge amount of memory. I am not sure it is expected behavior as I don’t know how it affects the computation graph but I will for sure not be using it for my experiments.

Pooling layers don’t have any parameters, so check the activation shapes instead, which are more likely bigger in your model using max pooling.

1 Like

Well I guess using conv+maxpool stores one more activation than just using one strided convolution. Besides, this activation is not downsampled, so each layer probably uses 3 times more memory with max pool than without, which is approximately the result I got. Thanks for the help !