Smaller model results in more cuda memory consumption

florobax · January 20, 2020, 4:43pm

Hi there !

I am facing a strange behaviour from cuda when using a handmade model. I have been using variants of resnet for some time now, but as I am overfitting I decided to create a smaller model to see how it goes.
The thing is, when I am using resnet18 (which has around 11M parameters) it uses around 3GB of cuda memory (witch batches of 64 images of size 224*224). However, when I use my custom model that has around 4M parameters, it uses almost 10GB of cuda memory. My model is the followiong:

GradesClassifModel(
  (base_model): Sequential(
    (0): Sequential(
      (0): Sequential(
        (0): ConvBnRelu(
          (conv): Conv2d(3, 64, kernel_size=(5, 5), stride=(1, 1))
          (bn): BatchNorm2d(64, eps=1e-05, momentum=0.01, affine=True, track_running_stats=True)
          (relu): ReLU()
        )
        (1): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
      )
      (1): Sequential(
        (0): ConvBnRelu(
          (conv): Conv2d(64, 128, kernel_size=(5, 5), stride=(1, 1))
          (bn): BatchNorm2d(128, eps=1e-05, momentum=0.01, affine=True, track_running_stats=True)
          (relu): ReLU()
        )
        (1): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
      )
      (2): Sequential(
        (0): ConvBnRelu(
          (conv): Conv2d(128, 256, kernel_size=(5, 5), stride=(1, 1))
          (bn): BatchNorm2d(256, eps=1e-05, momentum=0.01, affine=True, track_running_stats=True)
          (relu): ReLU()
        )
        (1): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
      )
      (3): Sequential(
        (0): ConvBnRelu(
          (conv): Conv2d(256, 512, kernel_size=(5, 5), stride=(1, 1))
          (bn): BatchNorm2d(512, eps=1e-05, momentum=0.01, affine=True, track_running_stats=True)
          (relu): ReLU()
        )
        (1): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
      )
    )
  )
  (head): Sequential(
    (0): AdaptiveAvgPool2d(output_size=1)
    (1): Flatten()
    (2): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.25, inplace=False)
    (4): Linear(in_features=512, out_features=512, bias=True)
    (5): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): Dropout(p=0.5, inplace=False)
    (7): Linear(in_features=512, out_features=2, bias=True)
  )
)

Does anyone have an idea why this occurs ? I have no clue if this is normal due to some obscure autograd behaviour or if I did something wrong.

ptrblck · January 21, 2020, 5:56am

Besides the parameters, the activations will consume a large portion of the memory.
Are you using less pooling layers (or other setups) in your custom model, which could yield larger output activations?

florobax · January 21, 2020, 9:56am

Found the problem. In the resnet I use, there is no pooling layer, just strided convolutions for downsampling while my small custom network uses max pool. By removing max pool and adding strided conv instead, I managed to lower the memory usage to something acceptable. My experiment showed these memory consumptions:

resnet18:
- 11M parameters
- 855 MiB of memory with model and batch loaded only
- 2755 MiB after forward pass
- 3537 MiB after backward pass
small network with max pool:
- 400K parameters
- 811 MiB with model and batch only
- 4329 MiB after forward
- 5059 MiB after backward
small network with strided conv:
- 400K parameters
- 811 MiB with model and batch only
- 1615 MiB after forward
- 1755 MiB after backward

The conclusion to all this is that max pool takes a huge amount of memory. I am not sure it is expected behavior as I don’t know how it affects the computation graph but I will for sure not be using it for my experiments.

ptrblck · January 21, 2020, 3:58pm

Pooling layers don’t have any parameters, so check the activation shapes instead, which are more likely bigger in your model using max pooling.

florobax · January 22, 2020, 8:59am

Well I guess using conv+maxpool stores one more activation than just using one strided convolution. Besides, this activation is not downsampled, so each layer probably uses 3 times more memory with max pool than without, which is approximately the result I got. Thanks for the help !