I am implementing SE-ResNet for a binary classification problem. I noticed that, in the description of the SE layers, linear layers were used to compute the attention map. However, in the code, 1x1 conv layers are used instead.
Is there a specific reason for this? Are the 1x1 conv more stable than linear layers? Or it that both can be used interchangeably, and it does not matter, which one of them is used?
It will be great if you could link the code as well in the question.
To your question, the short answer would be, in the context of SE-layer, both 1x1 conv and linear layer will do the same job.
Consider that the global average pooling of 2D features gives the output of dimension B x P x 1 x 1. B = batch, P = num channels. In the SE-layer, the next step is to convert the P dimensional vector to Q dimensional vector. (P, Q can be any number)
This can be achieved either by,
using a linear layer nn.Linear(in_features=P, out_features=Q)
(or)
by using a 1x1 conv layer nn.Conv2d(in_channels=P, out_channels=Q, kernel_size=1)
In the case of the linear layer, the global average pooling output (B x P x 1 x 1) needs to be squeezed to have the correct dimension (2D = B x P) to pass through the linear layer, then again convert back to 4D tensor. Instead, the 1x1 conv layer works with a 4D tensor without intermediate squeeze operation.
Hi, I apologize for my delayed response. Thank you for your replies @InnovArul and @J_Johnson, this gave me a better insight into the questions I asked.