How to fix size mismatch pretrained model for large input image sizes?


So I understand that pretrained models WITH dense layers require the exact image size the network was originally trained on for input. I know you can feed in different image sizes provided you add additional layers but I was wondering what is the best/optimal way.

Currently, I have input sizes of 512 x 512 pixels for a pretrained densenet that takes in 224 x 224 pixels. I’m not keen on resizing or cropping the images before placing in a tensor.

At the moment I have my network set up as the following where I extract features using the densenet and then pass through a dense layer (denoted as classifier) for further training, for 2 outputs (binary class). Here is a snippet of example code:

class DenseNetConv(torch.nn.Module):
    def __init__(self):
        original_model = models.densenet161(pretrained=True)
        self.features = torch.nn.Sequential(*list(original_model.children())[:-1])
        for param in self.parameters():
            param.requires_grad = False

    def forward(self, x):
        x = self.features(x)
        #x = F.relu(x, inplace=True)
        x = F.avg_pool2d(x, kernel_size=7).view(x.size(0), -1)
        return x

densenet= DenseNetConv()

classifier = nn.Linear(2208, args.num_classes)

# feature extraction and training

for i, (inputs, labels) in enumerate(dataloaders_dict['train']):

            inputs =
            labels =

            x = densenet(inputs)


            # Forward pass to get output/logits
            outputs =  classifier(x)
            # Calculate Loss: softmax --> cross entropy loss
            loss = criterion(outputs, labels)
            total_loss += loss.item()

            _, pred = torch.max(outputs, 1)
            equality_check = ( == pred)
            train_accuracy += equality_check.type(torch.FloatTensor).mean()

            # Getting gradients w.r.t. parameters


I’m wondering is it just the case of adding adaptive pooling layers at the end? Would that result in much information loss? Are there other methods I should be aware of that allows me to feed in 512 x 512 pixel images “optimally”?

Pretrained models with dense layers do not require the exact image size the model was trained on for input because average pooling is the default in many popular architectures (including DenseNet and ResNet). This used to be necessary before the use of average pooling before the final dense layers became popular, but even older models like the original AlexNet in torchvision have been “retrofitted” with average pooling to escape this limitation.

The reason resolution is important is how it changes the apparent scale of objects (in combination with other choices like the crop size) rather than some shape/architectural reason. In fact, it’s often better to evaluate the models at a slightly higher resolution (e.g., test @ >= 280x280 vs. train @ 224x224) than what they were trained on if the evaluation time crop is ~75% and random cropping was used at training time:

I’m not sure what the reason is behind the aversion to resizing or cropping at test time; consider how computation scales with resolution in CNNs (a conv at 1920x1080 would be roughly 41x more expensive than a conv at 224x224).

1 Like

Thanks for the reply, it’s very insightful. The higher resolution at test time is something I will investigate for future evaluation.

Nevertheless, I previously assumed that these networks do not accept any size with dense layers as a size mismatch error occurs when I extract features via the code snippet provided above and feed them into a dense layer:

RuntimeError: size mismatch, m1: [16 x 8832], m2: [2208 x 2]

Do I need to change amend the last avg pool layer in the feature extraction class, or add additional layers to accept the feature extracted input size?

So the main difference here is that your feature extraction is using average pooling, while the torchvision implementations of models e.g., DenseNet use adaptive average pooling (in this case also referred to as "global average pooling) to guarantee the output spatial dimensions are reduced to [1,1]. AdaptiveAvgPool2d — PyTorch 1.10 documentation

Adaptive average pooling should fix the error that you’re seeing. Intuitively, it shouldn’t result in much information loss for a classification task since the final output has no spatial information.

1 Like


So I tried passing a 512 x 512 pixel image into my above set up. I changed the last layer in the forward pass to:

def forward(self, x):
        x = self.features(x)
        #x = F.avg_pool2d(x, kernel_size=7).view(x.size(0), -1)
        x = nn.AdaptiveAvgPool2d((2208,2))
        return x

where (2208, 2) is the input for the classification layer.

(2208, 2) is what is normally output when feeding in an image size of 224 x 224 pixels.

However, with the newly added adaptive pooling layer, and feeding in an image size of 512 pixels, I get the following error:

AttributeError: 'AdaptiveAvgPool2d' object has no attribute 'dim'

I thought I could define the output in this adaptive pool layer, what am I doing wrong?

I’m a bit confused, it looks like you are defining the layer in the forward function when this is typically defined when the module is initialized.

For example, the init function could have something like

self.avgpool = nn.AdaptiveAvgPool2d((2208,2))

and forward would have

x = self.features(x)
x = self.avgpool(x)
return x
1 Like

Ah of course, my bad!

However, I’m still confused with the output size defined in the adaptive pool layer.

I currently get an runtime mismatch error

RuntimeError: size mismatch, m1: [78004224 x 2], m2: [2208 x 2]

when I apply

x = self.features(x)
x = self.avgpool(x)

as shown in your example and when using an image size of 512 x 512 pixels.

I printed the outputs from the first line extracting features and after adaptive pooling with the new code. This is what I get:

torch.Size([16, 2208, 7, 7]) # output from x =  self.features(x)

torch.Size([16, 2208, 2208, 2]) # using the new x =  self.avg.pool(x)

Previously, in the old example with average pooling used and an image size of 224 x 224 pixels, I would get:

torch.Size([2, 2208, 8, 8]) # x = self..features(x)
torch.Size([2, 2208]) # x=  F.avg_pool2d(x)

Am I missing a layer, or have I got the concept of output size completely wrong? My first guess would be that I need to reshape the adaptive pooling output as currently 16 * 2208 * 2208 gives me 78004224.

Adaptive average pooling is expected to manipulate the spatial dimensions (H, W) of the image, so the behavior is expected. Note that the channel dimension should not change depending on your input size in a typical CNN model, so you wouldn’t need to correct for this at inference time.

A standard approach would be to use AdaptiveAvgPool2d(1) to implement global average pooling, which should yield [N, 2208, 1, 1]. If you need [N, 2208] after this step, you can use a reshape operation.

1 Like

Ah ok thanks for the tips.

Just out of curiosity, why should the channel dimension not change during inference?

I apply the same feature extraction process as defined above (basically untrained, using only ImageNet features) extracting only features and then feed this into a classifier layer (as shown above). I train only the classifier layer and use this during inference for validation/test. I don’t think I’m manipulating the channel dimensions… It seems to work fine…

On another note, I’ve applied the AdaptiveAvgPool2d(1) to get [N, 2208, 1, 1] and reshaped it at the end of the forward pass function to get [N, 2208] :

  def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        return x

All working now!

Consider what it means for the channel dimension to change during inference; it means that the feature dimension of the model is changing. This would be an unexpected transformation as it corresponds to changing the “width” of a layer. It’s a nontrivial transformation as described in the literature: and typically requires additional training/finetuning.

What you’ve described so far sounds OK, I was concerned about the channel dimensions being changed as 2208 appears to be the extent of the channel dimension, so it should not appear in a call to AdaptiveAvgPool2d.

1 Like

Ah ok, I was under the impression I was just reshaping before feeding into a linear layer. In fact, since I know the output dimensions from feature extraction (torch.Size([4, 2208, 16, 16])) I believe AdaptiveAvgPool2d may not be a requirement…

I could have also just done:

x = F.avg_pool2d(x, kernel_size=16).view(x.size(0), -1)

which would also output torch.Size([4, 2208]), with 4 being the batch size… This would preserve the channel dimensions before being fed into the linear layer… I think…I’m guessing the 16 is the spatial dimensions and so reshaping wouldn’t do too much?