Pretrained models optimized for mobiles reduced lots of parameters, but still ask for a lot of GPU memory?

Description

While experimenting with pretrained models as visual feature extractors, I was surprised at GPU OOM for mobile-optimized models.
I profiled with simplified code and got result as below. The parameters of MNASNet and MobileNet are actually much less than ResNet’s,
but that doesn’t apply for much less GPU memory consumption. Is it reasonable or expected?

Example profiling

ResNet34Encoder: parameter#: 21,797,672
(0): mem_alloc: 1,772,397,056; max_mem_alloc: 1,842,586,624
(1): mem_alloc: 1,771,348,480; max_mem_alloc: 3,525,683,712
(2): mem_alloc: 1,772,397,056; max_mem_alloc: 3,525,683,712
(3): mem_alloc: 1,771,348,480; max_mem_alloc: 3,525,683,712
(4): mem_alloc: 1,772,397,056; max_mem_alloc: 3,525,683,712
MNASNet10Encoder: parameter#: 4,383,312
(0): mem_alloc: 2,359,172,096; max_mem_alloc: 4,131,510,784
(1): mem_alloc: 2,353,675,264; max_mem_alloc: 4,694,009,344
(2): mem_alloc: 2,354,584,576; max_mem_alloc: 4,694,009,344
(3): mem_alloc: 2,351,578,112; max_mem_alloc: 4,694,009,344
(4): mem_alloc: 2,358,516,736; max_mem_alloc: 4,694,009,344
MobileNetV2Encoder: parameter#: 3,504,872
(0): mem_alloc: 2,856,218,112; max_mem_alloc: 5,214,669,824
(1): mem_alloc: 2,857,602,560; max_mem_alloc: 5,699,356,160
(2): mem_alloc: 2,858,446,336; max_mem_alloc: 5,701,584,384
(3): mem_alloc: 2,851,966,464; max_mem_alloc: 5,701,584,384
(4): mem_alloc: 2,862,247,424; max_mem_alloc: 5,701,584,384

Example code to reproduce

import torch
import torch.nn as nn
from torchvision.models import resnet34, mnasnet1_0, mobilenet_v2

class ResNet34Encoder(nn.Module):
    def __init__(self):
        super(ResNet34Encoder, self).__init__()
        self.feature_extractor = resnet34(pretrained=True)

    def forward(self, x):
        x = self.feature_extractor(x)
        return x


class MNASNet10Encoder(nn.Module):
    def __init__(self):
        super(MNASNet10Encoder, self).__init__()
        self.feature_extractor = mnasnet1_0(pretrained=True)

    def forward(self, x):
        x = self.feature_extractor(x)
        return x


class MobileNetV2Encoder(nn.Module):
    def __init__(self):
        super(MobileNetV2Encoder, self).__init__()
        self.feature_extractor = mobilenet_v2(pretrained=True)

    def forward(self, x):
        x = self.feature_extractor(x)
        return x


def main():
    nets = [ResNet34Encoder(), MNASNet10Encoder(), MobileNetV2Encoder()]
    for net in nets:
        torch.cuda.empty_cache()
        torch.cuda.reset_max_memory_allocated()
        net.to('cuda')
        print(
            f'{net.__class__.__name__}: parameter#: {sum(p.numel() for p in net.parameters()):,}'
        )
        for n in range(5):
            x = torch.randn(10, 3, 512, 512)
            x = x.to('cuda')
            _ = net(x)
            print(
                f'({n}): mem_alloc: {torch.cuda.memory_allocated():,}; max_mem_alloc: {torch.cuda.max_memory_allocated():,}'
            )
        net.to('cpu')

if __name__ == "__main__":
    main()
1 Like

The old models are still on the GPU. Add net.to('cpu') at the end of the loop before you measure the next model :slight_smile:

Nice catch! The code has been corrected, and the measurements were updated as well. I also tried to measure each model independently (e.g. executing the program three times, with only one model in main() per run.). The results are similar. Thanks.

1 Like