Default weight initialisation for Conv layers (including SELU)

aroibu1 · June 26, 2020, 10:25am

Firstly, apologise if these are silly questions!

1 I am wondering what is the default initialisation utilised for Conv layers and is this dependent on the nonlinearity selected for after the layer?
2. When using a SELU nonlinearity, does the network automatically initialise the weights using the LeCun Normal Initialisation? If not, how could I implement weight initialisation manually to use the LeCun Normal?

A bit of context:

Reading through the various blog posts and questions from the past few years, for (1) I managed to find two opposing opinions: either that PyTorch automatically initialises all weights to LeCun Normal, or that PyTorch initialises weights based on the non-linearity used after the Conv Layer (Xavier for Tanh and Kaiming He for ReLU and ReLU derivated). However, when I check the source code (https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/conv.py), it appears that the default weight initlisation is Kaiming:

    def reset_parameters(self) -> None:
        init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)

In this case, my understanding is that that Kaiming is the default weight initialisation for Conv layers, no matter the following nonlinearity? Thus, most of the posts I have previously read are outdated.

Then, in order to implement the LeCun Normal initialisation, do I need to rewrite reset_parameters in my own code so that it overwrites the default PyTorch code?

Nikronic · June 26, 2020, 11:32am

Hi,

For the first question, please see these posts:

I have explained the magic number math.sqrt(5) so you can also get the idea behind the relation between non-linearity and init method. Acuatlly, default initialization is uniform.

Also, see this reply in the github thread about it https://github.com/pytorch/pytorch/issues/15314#issuecomment-477448573

About the second question, you can reinitialize weights after initializing them using default values. To do so, you can create your init function similar to available cases in torch.nn.init package and use a code similar to following snippet:

def init_weights(m):
    """
    Initialize weights of layers using Kaiming Normal (He et al.) as argument of "Apply" function of
    "nn.Module"
    :param m: Layer to initialize
    :return: None
    """

    if isinstance(m, nn.Conv2d):
        nn.init.kaiming_normal_(m.weight, mode='fan_out')
        nn.init.constant_(m.bias, 0)
    elif isinstance(m, nn.BatchNorm2d):
        nn.init.constant_(m.weight, 1)
        nn.init.constant_(m.bias, 0)

model.apply(init_weights)

Bests

aroibu1 · June 26, 2020, 12:43pm

Thanks @Nikronic! This makes things slightly more clear to me. However, I still have some questions, if that’s okay?

I understand the use of math.sqrt(5) and how this ties in with the nonlinearity, IF the nonlinearity is ReLU or LeakyReLU. In my case, I am using a PReLU non-linearity for now. This is similar to LeakyReLU, and the Kaiming Loss has been created for it. Given this, would the math.sqrt(5) still be a good choice for the a-parameter in Kaiming Loss?

Also, thank you for the advise on reinitilisation!

Nikronic · June 26, 2020, 1:03pm

Great, If you are using anything else rather than LeakyReLU, you need to get proper value based on gain of that particular activation function.

aroibu1 · June 26, 2020, 1:35pm

I understand. Do you have any suggestions on how this can be done, or any sources that I can consult? I had a look at torch.nn.init.calculate_gain but ‘prelu’ is not a supported nonlinearity.

Nikronic · June 26, 2020, 1:51pm

Yes, you need to find gain yourself in this case. Actually, I am not familiar with calculating gain so I cannot help with that. Let me know if you found anything.

aroibu1 · June 26, 2020, 3:06pm

Okay - this is a continuation of the message stream I had with @Nikronic, and is my solution to calculating the gain in order to properly use PReLU nonlinearity. I have not yet implemented this myself, but it’s what makes sense to me after reading the derivation in the original paper: https://arxiv.org/pdf/1502.01852.pdf. I’m writing this both for the community and my own later use

So, we start from the std definition in the paper:

Screenshot 2020-06-26 at 15.46.44

If we rearrange this, we obtain that the standard deviation (=sqrt(Var)) is given by the following two, which are the values obtained from torch.nn.init — PyTorch 2.1 documentation. The gain is basically identically calculated to the one in LeakyReLU.

Screenshot 2020-06-26 at 15.51.30

The equation derived in the original paper is the one used for the normal distribution, not the uniform one torch.nn.init.kaiming_normal_ . So, in theory, we could just use this one. However, this is not possible, as the kaiming_normal_ function in PyTorch calls torch.nn.init.calculate_gain which does not accept PReLU as a nonlinearity. Thus, we need a workaround this issue:

The alternative is to just calculate our own standard deviation, which is actually easier than I thought. In the paper, they suggest initiating the negative_slope to whatever value we also use to initiate it in our PReLU. For PyTorch, that would be 0.25 (torch.nn — PyTorch 2.1 documentation). We also need to calculate the fan_mode, for which we can look at how this is calculated in the PyTorch source (https://pytorch.org/docs/stable/modules/torch/nn/init.html#kaiming_normal):

def _calculate_fan_in_and_fan_out(tensor):
    dimensions = tensor.dim()
    if dimensions < 2:
        raise ValueError("Fan in and fan out can not be computed for tensor with fewer than 2 dimensions")

    num_input_fmaps = tensor.size(1)
    num_output_fmaps = tensor.size(0)
    receptive_field_size = 1
    if tensor.dim() > 2:
        receptive_field_size = tensor[0][0].numel()
    fan_in = num_input_fmaps * receptive_field_size
    fan_out = num_output_fmaps * receptive_field_size

    return fan_in, fan_out

Following the calcualtion of the std, and knowing we have a mean of 0, we can rewrite the code provided here (Example 10 - Python Examples of torch.nn.PReLU) to produce our required weight initialisation:

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = fan_in
                negative_slope = 0.25
                m.weight.data.normal_(0, math.sqrt(2. / (n * (1 + negative_slope ** 2)) ) )
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()

Of course, if one uses to use fan_out, they can just replace it using the code above.

@Nikronic - does what I wrote above make sense? Asking as I think you have more experience with these things than I do.

Nikronic · June 28, 2020, 5:34am

Hi, Sorry for my late answer, I am struggling with final exams!

First, thank you for your deep explanation. Secondly, I do not have strong mathematical background, so I do not think I am eligble to validate this but sounds fine to me.

Another point I would like to mention is that PyTorch uses uniform for initializing weights in convs and linear layers so if gain in PReLU is identical to LeakyReLU, then to achieve the range of [-1/sqrt(fan_mode), 1/sqrt(fan_mode)] for uniform distribution, still we need to consider negative_slope=sqrt(5) where otherwise it will lead to a different scenario.

I think we need to discuss this as a feature request so the main developers can help us with that. So, I think it would be great idea to create an issue on github.
Here is another issue related to this idea that may help. Furthermore, this thread considered another perspective which I have no clue what they are talking about .

If you created the issue on github, could you please also tag me so I can keep track of things? my username is Nikronic

Thank you