So i found this piece of code from the implementation of the paper “PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition” (It’s supposed to be a 14-layer CNN)
x = self.conv_block6(x, pool_size=(1, 1), pool_type='avg') #output of the last conv layer, x = F.dropout(x, p=0.2, training=self.training) # Dropout, global pooling is supposed to come after this (according to the model architecture at least) x = torch.mean(x, dim=3) #first step of Global avg pooling? (x1, _) = torch.max(x, dim=2) # Global max pooling? x2 = torch.mean(x, dim=2) #next step of Global avg pooling? x = x1 + x2 #Combining both results #adding fc layers and sigmoid output x = F.dropout(x, p=0.5, training=self.training) x = F.relu_(self.fc1(x)) embedding = F.dropout(x, p=0.5, training=self.training) clipwise_output = torch.sigmoid(self.fc_out(x))
Here they claim that “to combine their advantages, we sum the averaged and maximized vectors”, what I don’t understand is why? why is it better to add these two vectors instead of simply using Global avg pooling?
I’ve also seen people combining normal avg_pooling2d with max_pooling2d like this
x1 = F.avg_pool2d(x, kernel_size=pool_size) x2 = F.max_pool2d(x, kernel_size=pool_size) x = x1 + x2
same thing I’m not sure why is that the addition operation in particular helps with the results, why not using other operations? or just averaging the result instead of adding them?
Sorry if it is a little bit obvious, but I can’t quite grasp the idea behind this yet
PS.- I just realized that x1 is obtained from calculating the max after the mean of x, I thought that GlobalMaxPooling is supposed to be something like torch.max(x, dim=[2,3])
Is there something else I’m missing here?
Thanks for any help!!