Global average pooling means that you average each feature map separately. In your case if the feature map is of dimension
8 x 8, you average each and obtain a single value. The important part here is that you do the average operation per-channel. You can think of each of the feature maps as the final feature representation per category over which you want to do classification.
To do this you can apply either
kernel_size equal to the dimensions of the feature maps (in this case, 8).
The 10-way fc is because there are 10 categories. It’s like you extract features from all the preceeding conv layers and feed them into a linear classifier.
why not use
torch.mean to achieve this?
torch.mean works on one dimension instead of all three dimensions.
Just a note, the SqueezeNet architecture (available in PyTorch model zoo) uses global average pooling. Here’s global average pooling as implemented there:
final_conv = nn.Conv2d(512, self.num_classes, kernel_size=1)
self.classifier = nn.Sequential(
512 is the number of channels in the feature maps feeding in to this layer, and 13 is the number of rows and columns in the feature maps going in to this layer. You’ll need to change these depending on your network structure.
x = nn.avg_pool2d(x, x.size()[2:]) works fine when x.shape=N * C * H * W
Another way to do global average pooling for each feature map is to use
torch.mean as suggested by @Soumith_Chintala, but we need to flatten each feature map into to vector. The following snippet illustrates the idea,
# suppose x is your feature map with size N*C*H*W
x = torch.mean(x.view(x.size(0), x.size(1), -1), dim=2)
# now x is of size N*C
Also you can use
adaptive_avg_pool2d to achieve global average pooling, just set the output size to (1, 1),
import torch.nn.functional as F
x = F.adaptive_avg_pool2d(x, (1, 1))
Did anyone make any benchmarks? I’m guessing
mean was probably the fastest?
I’ve been searching an detail explanation for this type of pooling layer but it seems like a magic function, do you have any idea about it?
Think of it like this: suppose I am classifying dogs and all the images of dog are on the right. The first fc layer will form very strong weights with that region and not the rest as weights keep fluctuting there. If i avg pool instead before the fc, I make the fc invariant of position, so it will not recognise dogs in any position and not only the ones it was trained on.
x = torch.randn(1, 256, 100, 100)
x = torch.nn.AvgPool2d(kernel_size = 100, stride = 0, padding = 0, ceil_mode=False, count_include_pad=True)(x)
x = x.squeeze()
print(x.shape) # torch.Size()
edit : it will recognise*
nn.AdaptiveAvgPool2d(1) is another solution
A simple benchmark (run on Google colab):
import torch.nn.functional as F
x = torch.randn((256, 96, 128, 128)).cuda()
%timeit F.avg_pool2d(x, x.size()[2:])
%timeit F.adaptive_avg_pool2d(x, (1, 1))
%timeit torch.mean(x.view(x.size(0), x.size(1), -1), dim=2)
gives the output:
1000 loops, best of 3: 104 ms per loop
10 loops, best of 3: 104 ms per loop
1 loop, best of 3: 208 ms per loop
mean is actually faster.
Global Average Pooling was introduced in the paper Network in Network.
Up until this paper most networks used the convolutional backbones as feature extractors, and then these features were fed into fully connected layers, followed by an output layer. The problem with fully connected layers are manifold:
- They are prone to overfitting, and rely on regularizers like Dropout.
- They sit like a black box between the categorical outputs, and the spatial features extracted by convolutional backbone. In this way the correspondence between the features extracted, and the output are not clear. This is specially true while back propagating.
The idea is to remove an overfitting prone black box, and to replace it with a layer that uses the spatial features extracted by convolutional layers for outputs. Now, how do you do that?
First, you get in something that does not learn, or has no parameters.
Second, you want to directly translate the spatial features learned as direct indicators of categorical signals.
What is something that can fit well with these two points? Pooling. Not only does it not learn, but it also outputs something that directly corresponds to the spatial features learned by convolutional backbone.
And hence the idea of global average pooling, where you just average each channel to give an output.
Another direct benefit is during backpropagation: the categorical level errors are directly informed to the places of their origin in the convolutional backbone, and the learning is much better.
I keep seeing no reason to not use torch.mean… maybe anyone can explain clearly why it shouldn’t work?
torch.mean works on only one dimension.
you needs to use
torch.flatten to flatten height and width dimensions before using
torch.mean, and use
As of today, the above doesn’t work.
nn.AvgPool2d(x) works when x.shape=N * C * H * W.
FYI you can do, e.g.:
feature_maps = torch.rand(16, 512, 7, 7)
feature_vector = feature_maps.mean(dim=(-2, -1)) # or dim=(2, 3)