Hey guys, when I train models for an image classification task, I tried replace the pretrained model’s last fc layer with a `nn.Linear`

layer and a `nn.Conv2d`

layer(by setting kernel_size=1 to act as a fc layer) respectively and found that two models performs differently. Specifically the conv2d one always performs better on my task. I wonder if it is because the different initialization methods for the two layers and what’s the default initialization method for a conv2d layer and linear layer in PyTorch. Thank you in advance.

This is the initialization for linear:

And this is the initialization for conv:

Thank you richard. It seems that they are initialized samely when acting as fc layer. But I’m more confused about why one model performs better than the other.

Could you give some information regarding the input and output shape of the linear and conv layer?

The pretrained model’s feature map(after a avgpooling layer) is of shape **(bs, 512, 1, 1)**,

**so for nn.Conv2d layer,** it should be

`self.fc = nn.Conv2d(512, num_classes, kernel_size=1,stride=1, padding=0, bias=True)`

when forward,

```
# input x is the feature map
x = self.fc(x)
out = x.view(x.size(0), -1)
```

**for nn.Linear,**

`self.fc = nn.Linear(512, num_classes, bias=True)`

when forward,

`x = self.fc(x.view(x.size(0), -1))`

This looks good. Is the “conv model” performing better in every run?

You could use different seeds and check if it’s a random issue or systematic.

At the moment I don’t see any reason the conv layer should perform better than the linear layer.

Yeah, I also think it’s by coincidence. Anyway I’ll do several further experiment and double check my code.

Hi, is the initialization for conv2d xavier initialization?

Thanks.

Hi, have you solved this question?

It seems kaiming init is used for Convnd now:

I did not understand the initial assumption that a Conv2d with 1x1 filter size would act as a fully connected layer. I thought these were very different. Thinking in a single image: a fc layer would treat the image as an 1d vector and hece assign different weights for each pixel, while a 1x1 convolution would use the same filter in the whole image, corresponding to multiply all of the pixels for a same value. So this is not that the way things work?

Convolution and linear layers look quite different if you compare how the parameters are used in these operations, but note that you could use a matrix multiplication in a conv layer by unfolding the input and transforming the kernels appropriately (the input transformation is often called `im2col`

in the literature).

For a `1x1`

kernel, you could also view and permute the input to get the same results, as seen here:

```
x = torch.randn(2, 3, 24, 24)
conv = nn.Conv2d(3, 6, 1)
out_conv = conv(x)
lin = nn.Linear(3, 6)
with torch.no_grad():
lin.weight.copy_(conv.weight.squeeze()) # remove spatial size
lin.bias.copy_(conv.bias)
# permute x so that linear layer is executed on spatial dimension repeatedly
x = x.view(x.size(0), x.size(1), -1)
x = x.permute(0, 2, 1)
out_lin = lin(x)
# permute back to be able to compare results
out_lin = out_lin.permute(0, 2, 1)
out_lin = out_lin.view(out_lin.size(0), out_lin.size(1), 24, 24)
# compare
print((out_conv - out_lin).abs().max())
> tensor(4.7684e-07, grad_fn=<MaxBackward1>)
```

In particular these two layers are quite similar to each other.