Why are 1x1 conv used in SE layers instead of linear layers

I am implementing SE-ResNet for a binary classification problem. I noticed that, in the description of the SE layers, linear layers were used to compute the attention map. However, in the code, 1x1 conv layers are used instead.

Is there a specific reason for this? Are the 1x1 conv more stable than linear layers? Or it that both can be used interchangeably, and it does not matter, which one of them is used?

Thank you!

It will be great if you could link the code as well in the question.
To your question, the short answer would be, in the context of SE-layer, both 1x1 conv and linear layer will do the same job.

Consider that the global average pooling of 2D features gives the output of dimension B x P x 1 x 1. B = batch, P = num channels. In the SE-layer, the next step is to convert the P dimensional vector to Q dimensional vector. (P, Q can be any number)

This can be achieved either by,

  • using a linear layer
    nn.Linear(in_features=P, out_features=Q)

(or)

  • by using a 1x1 conv layer
    nn.Conv2d(in_channels=P, out_channels=Q, kernel_size=1)

In the case of the linear layer, the global average pooling output (B x P x 1 x 1) needs to be squeezed to have the correct dimension (2D = B x P) to pass through the linear layer, then again convert back to 4D tensor. Instead, the 1x1 conv layer works with a 4D tensor without intermediate squeeze operation.

A Conv2d 1x1 kernel is identical mathematically to a Linear operation. Please see here for a code demonstration:

Hi, I apologize for my delayed response. Thank you for your replies @InnovArul and @J_Johnson, this gave me a better insight into the questions I asked.