Given input size: (512x1x1). Calculated output size: (512x0x0). Output size is too small

Hello everyone,
I’m new to torch/PyTorch, and I’m currently trying to translate a script in Lua + torch into Python + PyTorch. The script in question implements a visual search model from a paper, and it can be found here.
The model that’s used is Caffe VGG16, but it’s loaded through torch.
Since it’s visual search, there are two different nn used: one for the stimuli (the image to be explored) and one for the target (the object to be found). They’re both very similar: the target model only has one extra layer (MaxPool2d, with kernel size 2x2 and stride 2). Both wrap VGG16 into an nn.Sequential container.
Here’s how they look:

Sequential(
  (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU(inplace=True)
  (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (3): ReLU(inplace=True)
  (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (6): ReLU(inplace=True)
  (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (8): ReLU(inplace=True)
  (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (11): ReLU(inplace=True)
  (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (13): ReLU(inplace=True)
  (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (15): ReLU(inplace=True)
  (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (18): ReLU(inplace=True)
  (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (20): ReLU(inplace=True)
  (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (22): ReLU(inplace=True)
  (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (25): ReLU(inplace=True)
  (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (27): ReLU(inplace=True)
  (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (29): ReLU(inplace=True)
  (30): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)

In Python, I’m using torchvision to load the VGG16 model. Up to the point where the models are fed by the input (line 105 in the Lua script quoted above), I’ve succesfully recreated every step. Everything is precisely the same: the dimensions and data.
However, I’ve encountered two different problems:
-I had to cast the tensors to 32-bit floats before feeding it to the nn, a step that’s not necessary in the Lua script (where they use 64-bit float tensors).
-The error described in the title.

Now, I know this error arises when you use an image that’s below the minimum size required by the model. And that’s precisely it: the stimuli model is fed a 224x224 image, which doesn’t raise any errors; the target model is fed a 28x28 image, which does raise an error.
The catch is: it doesn’t raise any errors in the Lua script, even though it’s a 28x28 image.

I’ve seen this thread, where ptrblck mentions that wrapping a model into an nn.Sequential container may cause some problems.
If anyone can give me a hint of why that’s so, I’d greatly appreciate it.

Here’s how I load the model and manipulate it (it’s not pretty):

import torchvision.models as models
import torch.nn as nn

# Load the model
model = models.vgg16(pretrained=True)
# Ignore first module since it's the net itself
layers = [module for module in model.modules()][1:]
stimuliLayers = list(layers[0])[:-1]
targetLayers = list(layers[0])
model_stimuli = nn.Sequential(*stimuliLayers)
model_target = nn.Sequential(*targetLayers)

As described in the linked post, the functional API calls would be dropped.
In your case this flatten operation would be missing.

Your error is raised, since model.modules() will recursively return all modules. You can check it via print(model.modules()).
model.children() would yield the submodules in their order of initialization.
However, even using this approach would assign the first nn.Sequential module (called .features in the original VGG16 model) to model_stimuli and model_target and will thus still raise the shape mismatch error.
Due to these shortcomings, I would thus only recommend to use nn.Sequential for very simple modules and not try to wrap other models inside it without checking the definition of the pretrained model first.

Alright, thanks for the quick response! I’ll continue to look into it, there’s so much I need to learn.

Could you print the Lua models, which would show all modules, as I’m still unsure, how the small input of 28x28 could be used in the feature extractor of the VGG model?

Sure.

This is how the model is loaded (having been previously downloaded from here and using a library for loading Caffe models in Torch7):

local cmodel = loadcaffe.load('Models/caffevgg16/VGG_ILSVRC_16_layers_deploy_old.prototxt', 'Models/caffevgg16/VGG_ILSVRC_16_layers_old.caffemodel', 'nn')
print(cmodel)

This is the output:

conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8: 1 1 4096 1000
nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> (40) -> output]
  (1): nn.SpatialConvolution(3 -> 64, 3x3, 1,1, 1,1)
  (2): nn.ReLU
  (3): nn.SpatialConvolution(64 -> 64, 3x3, 1,1, 1,1)
  (4): nn.ReLU
  (5): nn.SpatialMaxPooling(2x2, 2,2)
  (6): nn.SpatialConvolution(64 -> 128, 3x3, 1,1, 1,1)
  (7): nn.ReLU
  (8): nn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1)
  (9): nn.ReLU
  (10): nn.SpatialMaxPooling(2x2, 2,2)
  (11): nn.SpatialConvolution(128 -> 256, 3x3, 1,1, 1,1)
  (12): nn.ReLU
  (13): nn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
  (14): nn.ReLU
  (15): nn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
  (16): nn.ReLU
  (17): nn.SpatialMaxPooling(2x2, 2,2)
  (18): nn.SpatialConvolution(256 -> 512, 3x3, 1,1, 1,1)
  (19): nn.ReLU
  (20): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (21): nn.ReLU
  (22): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (23): nn.ReLU
  (24): nn.SpatialMaxPooling(2x2, 2,2)
  (25): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (26): nn.ReLU
  (27): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (28): nn.ReLU
  (29): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (30): nn.ReLU
  (31): nn.SpatialMaxPooling(2x2, 2,2)
  (32): nn.View(-1)
  (33): nn.Linear(25088 -> 4096)
  (34): nn.ReLU
  (35): nn.Dropout(0.500000)
  (36): nn.Linear(4096 -> 4096)
  (37): nn.ReLU
  (38): nn.Dropout(0.500000)
  (39): nn.Linear(4096 -> 1000)
  (40): nn.SoftMax
}

Once loaded, model_stimuli and model_target are built in the following way:

model_stimuli = nn.Sequential()
model_target = nn.Sequential()

for i = 1, 30 do    
	model_stimuli:add(cmodel:get(i))
end

for i = 1, 31 do      
	model_target:add(cmodel:get(i))
end

The output of print(model_target) is: (model_stimuli is precisely the same without layer #31):

nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> output]
  (1): nn.SpatialConvolution(3 -> 64, 3x3, 1,1, 1,1)
  (2): nn.ReLU
  (3): nn.SpatialConvolution(64 -> 64, 3x3, 1,1, 1,1)
  (4): nn.ReLU
  (5): nn.SpatialMaxPooling(2x2, 2,2)
  (6): nn.SpatialConvolution(64 -> 128, 3x3, 1,1, 1,1)
  (7): nn.ReLU
  (8): nn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1)
  (9): nn.ReLU
  (10): nn.SpatialMaxPooling(2x2, 2,2)
  (11): nn.SpatialConvolution(128 -> 256, 3x3, 1,1, 1,1)
  (12): nn.ReLU
  (13): nn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
  (14): nn.ReLU
  (15): nn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
  (16): nn.ReLU
  (17): nn.SpatialMaxPooling(2x2, 2,2)
  (18): nn.SpatialConvolution(256 -> 512, 3x3, 1,1, 1,1)
  (19): nn.ReLU
  (20): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (21): nn.ReLU
  (22): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (23): nn.ReLU
  (24): nn.SpatialMaxPooling(2x2, 2,2)
  (25): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (26): nn.ReLU
  (27): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (28): nn.ReLU
  (29): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (30): nn.ReLU
  (31): nn.SpatialMaxPooling(2x2, 2,2)
}

To be honest, I’m still unsure about torch’s get method behaviour (I’m struggling with torch’s documentation); at first glance, I’d have thought it’s the same as doing the following in PyTorch:

model_stimuli = nn.Sequential(*list(model.features.children())[:-1])
model_target = nn.Sequential(*list(model.features.children()))

But, evidently, it’s not.
(I’ve searched for more topics on this matter, such as this one, and I’m currently trying different variations of it.)

That’s also what I understand from the Lua code. However, since the feature extractor is used in almost the same way as in the original VGG model (the last layer is removed), it should also yield empty activations in the original code given the small input shape.

Note that the model which uses 28x28 images as input has all the same features as the VGG16 model (all 31 layers of it).

I’m thinking now that the difference may lie in how the model is loaded. Check loadcaffe’s third parameter, which indicates how to load the model. In this case, the value is ‘nn’ (the other possible values are ‘ccn2’ and ‘cudnn’):

local cmodel = loadcaffe.load('Models/caffevgg16/VGG_ILSVRC_16_layers_deploy_old.prototxt', 'Models/caffevgg16/VGG_ILSVRC_16_layers_old.caffemodel', 'nn')
print(cmodel)

And, effectively, the whole model is wrapped inside an nn.Sequential, even the classification layers (layers 32-40):

nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> (40) -> output]
  (1): nn.SpatialConvolution(3 -> 64, 3x3, 1,1, 1,1)
  (2): nn.ReLU
  (3): nn.SpatialConvolution(64 -> 64, 3x3, 1,1, 1,1)
  (4): nn.ReLU
  (5): nn.SpatialMaxPooling(2x2, 2,2)
  (6): nn.SpatialConvolution(64 -> 128, 3x3, 1,1, 1,1)
  (7): nn.ReLU
  (8): nn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1)
  (9): nn.ReLU
  (10): nn.SpatialMaxPooling(2x2, 2,2)
  (11): nn.SpatialConvolution(128 -> 256, 3x3, 1,1, 1,1)
  (12): nn.ReLU
  (13): nn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
  (14): nn.ReLU
  (15): nn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
  (16): nn.ReLU
  (17): nn.SpatialMaxPooling(2x2, 2,2)
  (18): nn.SpatialConvolution(256 -> 512, 3x3, 1,1, 1,1)
  (19): nn.ReLU
  (20): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (21): nn.ReLU
  (22): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (23): nn.ReLU
  (24): nn.SpatialMaxPooling(2x2, 2,2)
  (25): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (26): nn.ReLU
  (27): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (28): nn.ReLU
  (29): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (30): nn.ReLU
  (31): nn.SpatialMaxPooling(2x2, 2,2)
  (32): nn.View(-1)
  (33): nn.Linear(25088 -> 4096)
  (34): nn.ReLU
  (35): nn.Dropout(0.500000)
  (36): nn.Linear(4096 -> 4096)
  (37): nn.ReLU
  (38): nn.Dropout(0.500000)
  (39): nn.Linear(4096 -> 1000)
  (40): nn.SoftMax
}

Whereas PyTorch prints the model in the following way:

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (18): ReLU(inplace=True)
    (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (20): ReLU(inplace=True)
    (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (22): ReLU(inplace=True)
    (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (25): ReLU(inplace=True)
    (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (27): ReLU(inplace=True)
    (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (29): ReLU(inplace=True)
    (30): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
  (classifier): Sequential(
    (0): Linear(in_features=25088, out_features=4096, bias=True)
    (1): ReLU(inplace=True)
    (2): Dropout(p=0.5, inplace=False)
    (3): Linear(in_features=4096, out_features=4096, bias=True)
    (4): ReLU(inplace=True)
    (5): Dropout(p=0.5, inplace=False)
    (6): Linear(in_features=4096, out_features=1000, bias=True)
  )
)

Maybe you could shed some light on why this would make a difference, if it makes a difference at all. The only other difference I could find between the scripts was the casting of the input tensor to 32-bit, since the Caffe model accepts 64-bit tensors as input.

I’m not sure what you mean.
My argument is, that the feature extractor should yield an empty activation in both frameworks, if the model is created correctly and all parameters are equal.
I guess Caffe/Lua might use another rounding approach in calculating the output shape, which could yield a 1x1 activation in the end, and you could check it manually in both frameworks.

I’m unfamiliar with the paper, but find it a bit strange, that the last pooling layer is removed from the stimuli model, which gets the larger inputs, and not from the target model. Feeding the small input to the stimuli model works fine:

model_stimuli = nn.Sequential(*list(model.features.children())[:-1])
model_target = nn.Sequential(*list(model.features.children()))

out = model_stimuli(torch.randn(1, 3, 28, 28))
print(out.shape)
> torch.Size([1, 512, 1, 1])

You’re absolutely right, the model does work correctly if model_target leaves aside the last pooling layer!
Apparently, it’s no mistake in the code either, as it’s documented in the paper (I_s is the search Image, and I_t is the target Image):

Here’s Supplementary Figure 13, where it shows this configuration yielded the best results:
visualsearchzeroshot-layersperformance

Maybe the mistake is in resizing the target image to 28x28, instead of 224x224? Even though it’s clearly in the script, I couldn’t find anywhere in the paper where it stated that the target was resized to 28x28.

Anyway, I can’t thank you enough for all the help you’ve given me. I marked your latest post as the solution.