Global Average Pooling in Pytorch

why not use torch.mean to achieve this?


torch.mean works on one dimension instead of all three dimensions.

Try this:

F.avg_pool3d(tensor, kernel_size=input.size()[2:]).view(input.size()[0],-1)


Just a note, the SqueezeNet architecture (available in PyTorch model zoo) uses global average pooling. Here’s global average pooling as implemented there:

final_conv = nn.Conv2d(512, self.num_classes, kernel_size=1)
self.classifier = nn.Sequential(

512 is the number of channels in the feature maps feeding in to this layer, and 13 is the number of rows and columns in the feature maps going in to this layer. You’ll need to change these depending on your network structure.


x = nn.avg_pool2d(x, x.size()[2:]) works fine when x.shape=N * C * H * W


Another way to do global average pooling for each feature map is to use torch.mean as suggested by @Soumith_Chintala, but we need to flatten each feature map into to vector. The following snippet illustrates the idea,

# suppose x is your feature map with size N*C*H*W
x = torch.mean(x.view(x.size(0), x.size(1), -1), dim=2)
# now x is of size N*C

Also you can use adaptive_avg_pool2d to achieve global average pooling, just set the output size to (1, 1),

import torch.nn.functional as F
x = F.adaptive_avg_pool2d(x, (1, 1))

use nn.AdaptiveMaxPool2d,


Did anyone make any benchmarks? I’m guessing mean was probably the fastest?

I’ve been searching an detail explanation for this type of pooling layer but it seems like a magic function, do you have any idea about it?

Think of it like this: suppose I am classifying dogs and all the images of dog are on the right. The first fc layer will form very strong weights with that region and not the rest as weights keep fluctuting there. If i avg pool instead before the fc, I make the fc invariant of position, so it will not recognise dogs in any position and not only the ones it was trained on.

1 Like
x = torch.randn(1, 256, 100, 100)

x = torch.nn.AvgPool2d(kernel_size = 100, stride = 0, padding = 0, ceil_mode=False, count_include_pad=True)(x)

x = x.squeeze()

print(x.shape) # torch.Size([256])

edit : it will recognise*

nn.AdaptiveAvgPool2d(1) is another solution


A simple benchmark (run on Google colab):

import torch
import torch.nn.functional as F

x = torch.randn((256, 96, 128, 128)).cuda()

%timeit F.avg_pool2d(x, x.size()[2:])

%timeit F.adaptive_avg_pool2d(x, (1, 1))

%timeit torch.mean(x.view(x.size(0), x.size(1), -1), dim=2)

gives the output:

1000 loops, best of 3: 104 ms per loop
10 loops, best of 3: 104 ms per loop
1 loop, best of 3: 208 ms per loop

Looks mean is actually faster.


Global Average Pooling was introduced in the paper Network in Network.

Up until this paper most networks used the convolutional backbones as feature extractors, and then these features were fed into fully connected layers, followed by an output layer. The problem with fully connected layers are manifold:

  • They are prone to overfitting, and rely on regularizers like Dropout.
  • They sit like a black box between the categorical outputs, and the spatial features extracted by convolutional backbone. In this way the correspondence between the features extracted, and the output are not clear. This is specially true while back propagating.

The idea is to remove an overfitting prone black box, and to replace it with a layer that uses the spatial features extracted by convolutional layers for outputs. Now, how do you do that?

First, you get in something that does not learn, or has no parameters.
Second, you want to directly translate the spatial features learned as direct indicators of categorical signals.
What is something that can fit well with these two points? Pooling. Not only does it not learn, but it also outputs something that directly corresponds to the spatial features learned by convolutional backbone.

And hence the idea of global average pooling, where you just average each channel to give an output.

Another direct benefit is during backpropagation: the categorical level errors are directly informed to the places of their origin in the convolutional backbone, and the learning is much better.


I keep seeing no reason to not use torch.mean… maybe anyone can explain clearly why it shouldn’t work?

1 Like

torch.mean works on only one dimension.
you needs to use torch.flatten to flatten height and width dimensions before using torch.mean, and use torch.reshape afterward.

As of today, the above doesn’t work. nn.AvgPool2d(x) works when x.shape=N * C * H * W.

FYI you can do, e.g.:

import torch
feature_maps = torch.rand(16, 512, 7, 7)
feature_vector = feature_maps.mean(dim=(-2, -1))  # or dim=(2, 3)
torch.Size([16, 512])

you can use torch.mean() n times for n dimension you need