Somthing about Linear.py in nn moudles

In official documentation, the weights of Linear layer are initialized by the formula:1/in_features , but l find the formula in linear.py is another one ,as show in picture.I want to know why " a=math.sqrt(5)" ,when i choose ReLU as activate function.WHY?

The initialization done is same as in the docs. The implementation simply uses the kaiming_uniform_ function (torch.nn.init — PyTorch 1.7.0 documentation) and passes the right parameter to get a uniform distribution in (-1/sqrt(in_features),1/sqrt(in_features)) . In Kaiming uniform initialization, there is an extra sqrt(3). So they’ve made the gain value 1/sqrt(3) using

gain = torch.nn.init.calculate_gain(nonlinearity='leaky_relu', param=math.sqrt(5))

Check pytorch/init.py at 4ff1823fac8833d4bdfe333c49fd486143175395 · pytorch/pytorch · GitHub.

Seems like a workaround to avoid writing another function that might duplicate code in kaiming_uniform_

I can understand the operation of initializing weights. What i can’t understand is that in docs( torch.nn.linear) it says the weights of Linear layer are initialized by the formula:1/in_features. But in linear.py, it use kaiming_uniform_,which youcan find in this link :https://github.com/pytorch/pytorch/blob/4ff1823fac8833d4bdfe333c49fd486143175395/torch/nn/modules/linear.py#L35

And here’s the weirdest thing for me is the function init.kaiming_uniform_(self.weight, a=math.sqrt(5)),which means the weights for the full connection layer are initialized by default using the formula corresponding to LeakyRelu.I want to know why it choose a=math.sqrt(5) instead of the negative_slope chosed by user.

Hey. I realise now that my answer wasn’t clear earlier. They’ve used the Kaiming uniform initialisation, yes but they’ve chosen the parameter a such that gain is set to be 1/sqrt(3). The way they’ve done that is using the calculate_gain function (pytorch/init.py at 4ff1823fac8833d4bdfe333c49fd486143175395 · pytorch/pytorch · GitHub). With nonlinearity=leaky_relu, the gain factor is math.sqrt(2.0 / (1 + param ** 2)). Choose a=param=math.sqrt(5). Then the gain factor is 1/sqrt(3).

Maybe I didn’t make my question clear. I have understood the operation of Kaiming initialization. What I don’t understand is why when i use ReLu as the activation function , the full connection layer uses the default scheme: init.kaiming_uniform_(self.weight, a=math.sqrt(5))to initialize its weights, which is different from its docs and l have check its resnet.py(vision/resnet.py at 146dd8556c4f3b2ad21fad9312f4602cd049d6da · pytorch/vision · GitHub), in which ReLu is used as activation function, but FC is initialied by deauflt, too.
I think the initialization parameter a should be set to 0 when ReLu is used as activation function, but now it is sqrt(5),which shoulid be used in LeakyReLU.

init.kaiming_uniform_(self.weight, a=math.sqrt(5)) in Linear.py.Check it in:pytorch/linear.py at 4ff1823fac8833d4bdfe333c49fd486143175395 · pytorch/pytorch · GitHub

Linear class has no knowledge about activations, so this reset_parameters function had to be universal. It is also invoked from the constructor, thus it has no arguments, you’re supposed to manually re-initialize parameters after construction if defaults don’t work well.

As @googlebot mentions, this initialization is universal, invoked in the constructor of any linear layer. The nonlinearity argument passed to the kaiming_uniform_ function is not based on the nonlinearity that follows the Linear layer. The whole idea is to use the kaiming_uniform_ function to the get the uniform initialization we want, by passing the right set of arguments (a and nonlinearity).

You can maybe think about this as PyTorch devs trying to avoid writing any code that duplicates statements in kaiming_uniform_.

Thank you very much for your constant reply. I have understood it.