# Somthing about Linear.py in nn moudles

In official documentation, the weights of Linear layer are initialized by the formula:1/in_features , but l find the formula in linear.py is another one ,as show in picture.I want to know why " a=math.sqrt(5)" ,when i choose ReLU as activate function.WHY?

The initialization done is same as in the docs. The implementation simply uses the `kaiming_uniform_` function (torch.nn.init — PyTorch 1.7.0 documentation) and passes the right parameter to get a uniform distribution in (-1/`sqrt(in_features)`,1/`sqrt(in_features)`) . In Kaiming uniform initialization, there is an extra `sqrt(3)`. So they’ve made the `gain` value 1/`sqrt(3)` using

``````gain = torch.nn.init.calculate_gain(nonlinearity='leaky_relu', param=math.sqrt(5))
``````

Seems like a workaround to avoid writing another function that might duplicate code in `kaiming_uniform_`

I can understand the operation of initializing weights. What i can’t understand is that in docs( torch.nn.linear) it says the weights of Linear layer are initialized by the formula:`1/in_features`. But in linear.py, it use `kaiming_uniform_`,which youcan find in this link :https://github.com/pytorch/pytorch/blob/4ff1823fac8833d4bdfe333c49fd486143175395/torch/nn/modules/linear.py#L35

And here’s the weirdest thing for me is the function `init.kaiming_uniform_(self.weight, a=math.sqrt(5))`,which means the weights for the full connection layer are initialized by default using the formula corresponding to LeakyRelu.I want to know why it choose `a=math.sqrt(5)` instead of the negative_slope chosed by user.

Hey. I realise now that my answer wasn’t clear earlier. They’ve used the Kaiming uniform initialisation, yes but they’ve chosen the parameter `a` such that `gain` is set to be `1/sqrt(3)`. The way they’ve done that is using the `calculate_gain` function (pytorch/init.py at 4ff1823fac8833d4bdfe333c49fd486143175395 · pytorch/pytorch · GitHub). With `nonlinearity=leaky_relu`, the gain factor is `math.sqrt(2.0 / (1 + param ** 2))`. Choose `a=param=math.sqrt(5)`. Then the gain factor is `1/sqrt(3)`.

Maybe I didn’t make my question clear. I have understood the operation of Kaiming initialization. What I don’t understand is why when i use ReLu as the activation function , the full connection layer uses the default scheme: `init.kaiming_uniform_(self.weight, a=math.sqrt(5))`to initialize its weights, which is different from its docs and l have check its resnet.py(vision/resnet.py at 146dd8556c4f3b2ad21fad9312f4602cd049d6da · pytorch/vision · GitHub), in which ReLu is used as activation function, but FC is initialied by deauflt, too.
I think the initialization parameter `a` should be set to `0` when ReLu is used as activation function, but now it is `sqrt(5)`,which shoulid be used in LeakyReLU.

`init.kaiming_uniform_(self.weight, a=math.sqrt(5))` in Linear.py.Check it in:pytorch/linear.py at 4ff1823fac8833d4bdfe333c49fd486143175395 · pytorch/pytorch · GitHub

Linear class has no knowledge about activations, so this reset_parameters function had to be universal. It is also invoked from the constructor, thus it has no arguments, you’re supposed to manually re-initialize parameters after construction if defaults don’t work well.

As @googlebot mentions, this initialization is universal, invoked in the constructor of any linear layer. The `nonlinearity` argument passed to the `kaiming_uniform_` function is not based on the nonlinearity that follows the Linear layer. The whole idea is to use the `kaiming_uniform_` function to the get the uniform initialization we want, by passing the right set of arguments (`a` and `nonlinearity`).

You can maybe think about this as PyTorch devs trying to avoid writing any code that duplicates statements in `kaiming_uniform_`.

Thank you very much for your constant reply. I have understood it.