What is the default initialization of a conv2d layer and linear layer?

insomnia250 · April 6, 2018, 3:07pm

Hey guys, when I train models for an image classification task, I tried replace the pretrained model’s last fc layer with a nn.Linear layer and a nn.Conv2d layer(by setting kernel_size=1 to act as a fc layer) respectively and found that two models performs differently. Specifically the conv2d one always performs better on my task. I wonder if it is because the different initialization methods for the two layers and what’s the default initialization method for a conv2d layer and linear layer in PyTorch. Thank you in advance.

richard · April 6, 2018, 3:11pm

This is the initialization for linear:

github.com

pytorch/pytorch/blob/master/torch/nn/modules/linear.py#L48-L52


def reset_parameters(self):
    stdv = 1. / math.sqrt(self.weight.size(1))
    self.weight.data.uniform_(-stdv, stdv)
    if self.bias is not None:
        self.bias.data.uniform_(-stdv, stdv)

And this is the initialization for conv:

github.com

pytorch/pytorch/blob/08891b0a4e08e2c642deac2042a02238a4d34c67/torch/nn/modules/conv.py#L40-L47


def reset_parameters(self):
    n = self.in_channels
    for k in self.kernel_size:
        n *= k
    stdv = 1. / math.sqrt(n)
    self.weight.data.uniform_(-stdv, stdv)
    if self.bias is not None:
        self.bias.data.uniform_(-stdv, stdv)

insomnia250 · April 6, 2018, 3:41pm

Thank you richard. It seems that they are initialized samely when acting as fc layer. But I’m more confused about why one model performs better than the other.

ptrblck · April 6, 2018, 4:39pm

Could you give some information regarding the input and output shape of the linear and conv layer?

insomnia250 · April 7, 2018, 1:30am

The pretrained model’s feature map(after a avgpooling layer) is of shape (bs, 512, 1, 1),
so for nn.Conv2d layer, it should be
self.fc = nn.Conv2d(512, num_classes, kernel_size=1,stride=1, padding=0, bias=True)
when forward,

# input x is the feature map
x = self.fc(x)
out = x.view(x.size(0), -1)

for nn.Linear,
self.fc = nn.Linear(512, num_classes, bias=True)
when forward,
x = self.fc(x.view(x.size(0), -1))

ptrblck · April 7, 2018, 11:18am

This looks good. Is the “conv model” performing better in every run?
You could use different seeds and check if it’s a random issue or systematic.

At the moment I don’t see any reason the conv layer should perform better than the linear layer.

insomnia250 · April 7, 2018, 1:26pm

Yeah, I also think it’s by coincidence. Anyway I’ll do several further experiment and double check my code.

Ariel_biubiu · May 2, 2018, 3:32am

Hi, is the initialization for conv2d xavier initialization?
Thanks.

ayumiymk · December 21, 2018, 3:49am

Hi, have you solved this question?

zyc · November 2, 2020, 2:26pm

It seems kaiming init is used for Convnd now:

github.com

pytorch/pytorch/blob/main/torch/nn/modules/conv.py#L110-L115


      
          self.output_padding = output_padding
          self.groups = groups
          self.padding_mode = padding_mode
          # `_reversed_padding_repeated_twice` is the padding to be passed to
          # `F.pad` if needed (e.g., for non-zero padding types that are
          # implemented as two ops: padding + conv). `F.pad` accepts paddings in

julianolm · February 26, 2021, 11:50am

I did not understand the initial assumption that a Conv2d with 1x1 filter size would act as a fully connected layer. I thought these were very different. Thinking in a single image: a fc layer would treat the image as an 1d vector and hece assign different weights for each pixel, while a 1x1 convolution would use the same filter in the whole image, corresponding to multiply all of the pixels for a same value. So this is not that the way things work?

ptrblck · February 27, 2021, 6:42am

Convolution and linear layers look quite different if you compare how the parameters are used in these operations, but note that you could use a matrix multiplication in a conv layer by unfolding the input and transforming the kernels appropriately (the input transformation is often called im2col in the literature).
For a 1x1 kernel, you could also view and permute the input to get the same results, as seen here:

x = torch.randn(2, 3, 24, 24)
conv = nn.Conv2d(3, 6, 1)
out_conv = conv(x)

lin = nn.Linear(3, 6)
with torch.no_grad():
    lin.weight.copy_(conv.weight.squeeze()) # remove spatial size
    lin.bias.copy_(conv.bias)
    
# permute x so that linear layer is executed on spatial dimension repeatedly
x = x.view(x.size(0), x.size(1), -1)
x = x.permute(0, 2, 1)
out_lin = lin(x)

# permute back to be able to compare results
out_lin = out_lin.permute(0, 2, 1)
out_lin = out_lin.view(out_lin.size(0), out_lin.size(1), 24, 24)

# compare
print((out_conv - out_lin).abs().max())
> tensor(4.7684e-07, grad_fn=<MaxBackward1>)

In particular these two layers are quite similar to each other.

aoot · January 3, 2025, 9:59pm

According to (Liu et al., 2021)'s last sentence on page 7, all PyTorch networks defaults to the Xavier uniform initialization.

Liu, S., Li, X., Zhai, Y., You, C., Zhu, Z., Fernandez-Granda, C., & Qu, Q. (2021). Convolutional Normalization: Improving Deep Convolutional Network Robustness and Training. Advances in Neural Information Processing Systems, 34, 28919–28928. Convolutional Normalization: Improving Deep Convolutional Network Robustness and Training