Is there an error in all VGG architectures?

Hello. I was trying to understand the architecture of VGG19_bn and I stumbled upon something interesting.
After the convolution layers, there is an adaptive average pooling(7,7). My understanding is that 7x7 is the requested output of the average pooling. However, the feature maps that feed into the average pooling are n x 7 x 7, meaning that the average will happen across each single element (so average of a 1x1 element, result in the element itself). I checked stepping into the code that the output of the average pooling is exactly the same as it’s input, both in terms of size and in terms of valus.
Why is the average pooling even there, since it’s not doing anything? Is this a mistake?

The adaptive pooling layers were added to relax the condition on specific input shapes.
By default a lot of models are trained on images of [batch_size, 3, 224, 224].
As you said, for this shape the original architecture should be used.
However, if you would like to increase the spatial size to e.g. 350x350 pixels, you would run into a shape mismatch error and would have to resize the image to 224x224.
To relax this condition and make these models more flexible, the adaptive pooling layers were added, which will return the desired output shape so that the number of input features of this activation will match the specified in_features of the following linear layers.

Got it! Thanks! Since we are at it, let me ask another question. I really like the ‘adaptive’ feature of the average pooling. Is there anything similar for the conv layer, where we specify the size of the output feature map, and the kernel size is automatically computed? Thanks

Unfortunately, there isn’t such a method for conv layers, as you would have to manipulate the trainable parameters for each different input shape.
Since pooling layers do not have trainable parameters, it’s not problematic to change the kernel shape and setup.
However, the filters in conv layers have a certain shape and if you would like to increase their spatial size, you would have to reinitialize the new elements (or use any other methods such as cloning/mirroring certain values).

1 Like