I have an AlexNet like neural network like:
self.conv = nn.Sequential()
self.conv.add_module('conv1_s1',nn.Conv2d(3, 64, kernel_size=9, stride=2, padding=0))
self.conv.add_module('conv2_s1',nn.Conv2d(64, 96, kernel_size=5, padding=2, groups=2))
self.conv.add_module('conv3_s1',nn.Conv2d(96, 128, kernel_size=3, padding=1))
self.conv.add_module('conv4_s1',nn.Conv2d(128, 128, kernel_size=3, padding=1, groups=2))
self.conv.add_module('conv5_s1',nn.Conv2d(128, 96, kernel_size=3, padding=1, groups=2))
self.fc6 = nn.Sequential()
The output of conv5_s1 is 96 channels of size 2x2 (see fc6_s1 after flattening). Does it make sense to have such a high number of channels for a small output dimension of 2x2?. So, the same feature map will be produced many times. Would 24 channels in layer conv5_s1 be sufficient, since 2x2=4 factorial = 24 possibilities? Or is this a wrong understanding?
Thanks for help!
I must admit I don’t follow your reasoning.
My intuition would be more the following:
- In the input, you have largish spatial dimensions and only 3 channels. There isn’t much information per pixels in it.
- In the output (the classification), you don’t have any spatial dimension but typically as many “channels” as you have classes - you might say you have 1x1 x classes. So there much information per pixel - in the single pixel. But if you come from inputs like 200x200x3, this is much lower than 1x1x1000.
The typical pattern in the inner layers of ResNet is to double number of channels while halving the spatial resolution in both channels - so a new halving of the dimension.
I’m not sure how you count “feature maps” here, maybe that is part of why it is complicated. I would take a simpler approach and think of dimensionalities of the intermediate layers.
it was an error in my reasoning, indeed.
Thanks a lot for clarification.