Cuda out of memory when modifying model class

cfg = {
‘VGG11’: [64, ‘M’, 128, ‘M’, 256, 256, ‘M’, 512, 512, ‘M’, 512, 512, ‘M’],
‘VGG13’: [64, 64, ‘M’, 128, 128, ‘M’, 256, 256, ‘M’, 512, 512, ‘M’, 512, 512, ‘M’],
‘VGG16’: [64, 64, ‘M’, 128, 128, ‘M’, 256, 256, 256, ‘M’, 512, 512, 512, ‘M’, 512, 512, 512, ‘M’],
‘VGG19’: [64, 64, ‘M’, 128, 128, ‘M’, 256, 256, 256, 256, ‘M’, 512, 512, 512, 512, ‘M’, 512, 512, 512, 512, ‘M’],
}

class VGG(nn.Module):

def __init__(self, vgg_name):
    super(VGG, self).__init__()
    self.cfg = cfg[vgg_name]
    self.teacher = self._make_layers()
    self.pool = nn.AvgPool2d(kernel_size=1, stride=1)
    self.linear = nn.Linear(512, 3) # change last layer to 3

def forward(self, x):
    teacher_counter = 0
    student_features = []
    feature = x
    for block in self.cfg:
        if block == 'M':
            feature = self.teacher[teacher_counter](feature)
            student_features.append(feature)
            teacher_counter += 1

    out = self.pool(feature)
    out = out.view(out.size(0), -1)
    out = self.linear(feature)

    print(len(student_features))
    return out, student_features


def _make_layers(self):
    layers = []
    teacher = []
    in_channels = 3
    for x in self.cfg:
        if x == 'M':
            teacher.append(nn.Sequential(*layers).to("cuda"))
            layers = []
        else:
            layers += [
                    nn.Conv2d(in_channels, x, kernel_size=3, padding=1),
                    nn.BatchNorm2d(x),
                    nn.ReLU(inplace=True)
                ]
            in_channels = x

    return teacher

The goal is to return the final feature of VGG16 and get features for each ‘M’ in student_features. Could anyone please let me know if anything is wrong with the above implementation? The error message now says “CUDA out of memory” even though it works perfectly with the original VGG-16 under the same batch size

I would recommend to compare your custom VGG model against the torchvision implementation on the CPU first and check where the difference might be coming from first.
You could start with the general model architecture by checking all layers and then you could continue with the intermediate activations etc.

Hi @ptrblck, thanks for your advice! Could you confirm whether my implementation is correct to get a final feature and feature for every ‘M’ layer/block?

I cannot verify it as your code is not executable, but execute the code and check the returned student_features to see if the stored features are all different.
I would guess that student_features.append(feature) might append a reference to feature and, since it’s overridden in each iteration, it could store the same reference in the list.
In this case, use student_features.append(feature.clone()) and it should work.

Hi @ptrblck, thanks for the input!

It appears that CUDA is out of memory after calling feature = self.teacher [teacher_counter] (feature) 3 times (it should be 5 times). Does calling nn.Sequential many times make the memory full (even if it’s only a small part of the model)? Is there a way to release the memory?

No, the nn.Sequential container will not use more memory than calling layers manually.
Note that you are storing additional features in the forward pass (along with the computation graph), which will of course increase the memory usage, so the OOM might be expected.