I am trying to use global average pooling, however I have no idea on how to implement this in pytorch. So global average pooling is described briefly as:

It means that if you have a 3D 8,8,128 tensor at the end of your last convolution, in the traditional method, you flatten it into a 1D vector of size 8x8x128. And you then add one or several fully connected layers and then at the end, a softmax layer that reduces the size to 10 classification categories and applies the softmax operator.

The global average pooling means that you have a 3D 8,8,10 tensor and compute the average over the 8,8 slices, you end up with a 3D tensor of shape 1,1,10 that you reshape into a 1D vector of shape 10. And then you add a softmax operator without any operation in between. The tensor before the average pooling is supposed to have as many channels as your model has classification categories.

tensor = self.Conv2d(output_size = 10, kernel_size=1) #to get [10x8x8] size
tensor = self.GovalAvgPooling(tensor) #whatever this is , to get [10, 1, 1]
tensor = self.Squeeze_Dims(tensor) # to just get a vector [10]
tensor = self.Softmax(tensor)

Here are the questions:

Are the above examples correct, keeping in mind the description of global average pooling?

How can I do the global average pooling? Should I use the functional module?

The paper I am trying to reproduce (residual nets) says that:

The network ends with a global average pooling, a 10-way fully-connected layer, and softmax.

But this does not make sense ?? Why do they need the 10-way fc layer?

Global average pooling means that you average each feature map separately. In your case if the feature map is of dimension 8 x 8, you average each and obtain a single value. The important part here is that you do the average operation per-channel. You can think of each of the feature maps as the final feature representation per category over which you want to do classification.

To do this you can apply either nn.AvgPool2d or F.avg_pool2d with kernel_size equal to the dimensions of the feature maps (in this case, 8).

The 10-way fc is because there are 10 categories. It’s like you extract features from all the preceeding conv layers and feed them into a linear classifier.

Just a note, the SqueezeNet architecture (available in PyTorch model zoo) uses global average pooling. Here’s global average pooling as implemented there:

512 is the number of channels in the feature maps feeding in to this layer, and 13 is the number of rows and columns in the feature maps going in to this layer. You’ll need to change these depending on your network structure.

Another way to do global average pooling for each feature map is to use torch.mean as suggested by @Soumith_Chintala, but we need to flatten each feature map into to vector. The following snippet illustrates the idea,

# suppose x is your feature map with size N*C*H*W
x = torch.mean(x.view(x.size(0), x.size(1), -1), dim=2)
# now x is of size N*C

Also you can use adaptive_avg_pool2d to achieve global average pooling, just set the output size to (1, 1),

import torch.nn.functional as F
x = F.adaptive_avg_pool2d(x, (1, 1))

Think of it like this: suppose I am classifying dogs and all the images of dog are on the right. The first fc layer will form very strong weights with that region and not the rest as weights keep fluctuting there. If i avg pool instead before the fc, I make the fc invariant of position, so it will not recognise dogs in any position and not only the ones it was trained on.

Up until this paper most networks used the convolutional backbones as feature extractors, and then these features were fed into fully connected layers, followed by an output layer. The problem with fully connected layers are manifold:

They are prone to overfitting, and rely on regularizers like Dropout.

They sit like a black box between the categorical outputs, and the spatial features extracted by convolutional backbone. In this way the correspondence between the features extracted, and the output are not clear. This is specially true while back propagating.

The idea is to remove an overfitting prone black box, and to replace it with a layer that uses the spatial features extracted by convolutional layers for outputs. Now, how do you do that?

First, you get in something that does not learn, or has no parameters.
Second, you want to directly translate the spatial features learned as direct indicators of categorical signals.
What is something that can fit well with these two points? Pooling. Not only does it not learn, but it also outputs something that directly corresponds to the spatial features learned by convolutional backbone.

And hence the idea of global average pooling, where you just average each channel to give an output.

Another direct benefit is during backpropagation: the categorical level errors are directly informed to the places of their origin in the convolutional backbone, and the learning is much better.