# nn.Linear default weight initialisation assumes leaky relu activation

In the code for nn.Linear, initialisation occurs in the reset_parameters() method. This method calls init.kaiming_uniform_ (see below)

def reset_parameters(self):
init.kaiming_uniform_(self.weight, a=math.sqrt(5))
if self.bias is not None:
fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
bound = 1 / math.sqrt(fan_in)
init.uniform_(self.bias, -bound, bound)


Looking at the kaiming_uniform_ function definition, by default it assumes a leaky relu activation, which affects one of the constants used in the initialisation distribution:

def kaiming_uniform_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu'):
r"""Fills the input Tensor with values according to the method
described in Delving deep into rectifiers: Surpassing human-level
performance on ImageNet classification - He, K. et al. (2015), using a
uniform distribution. The resulting tensor will have values sampled from
:math:\mathcal{U}(-\text{bound}, \text{bound}) where

.. math::
\text{bound} = \sqrt{\frac{6}{(1 + a^2) \times \text{fan\_in}}}

Also known as He initialization.

Args:
tensor: an n-dimensional torch.Tensor
a: the negative slope of the rectifier used after this layer (0 for ReLU
by default)
mode: either 'fan_in' (default) or 'fan_out'. Choosing 'fan_in'
preserves the magnitude of the variance of the weights in the
forward pass. Choosing 'fan_out' preserves the magnitudes in the
backwards pass.
nonlinearity: the non-linear function (nn.functional name),
recommended to use only with 'relu' or 'leaky_relu' (default).

Examples:
>>> w = torch.empty(3, 5)
>>> nn.init.kaiming_uniform_(w, mode='fan_in', nonlinearity='relu')
"""
fan = _calculate_correct_fan(tensor, mode)
gain = calculate_gain(nonlinearity, a)
std = gain / math.sqrt(fan)
bound = math.sqrt(3.0) * std  # Calculate uniform bounds from standard deviation
return tensor.uniform_(-bound, bound)


Looking in the calculate_gain function, we can confirm that the resulting gain is different for leaky relu vs other types of activations (including no activation):

def calculate_gain(nonlinearity, param=None):
r"""Return the recommended gain value for the given nonlinearity function.
The values are as follows:

================= ====================================================
nonlinearity      gain
================= ====================================================
Linear / Identity :math:1
Conv{1,2,3}D      :math:1
Sigmoid           :math:1
Tanh              :math:\frac{5}{3}
ReLU              :math:\sqrt{2}
Leaky Relu        :math:\sqrt{\frac{2}{1 + \text{negative\_slope}^2}}
================= ====================================================\
Args:
nonlinearity: the non-linear function (nn.functional name)
param: optional parameter for the non-linear function

Examples:
>>> gain = nn.init.calculate_gain('leaky_relu')
"""


Is this desired behaviour? Wouldn’t a more appropriate default activation be linear / identity?

The intent / original code doesn’t assume leaky relu, it does initialization according to the paper Efficient Backprop (Lecun et. al. 1998). It so happens that the initialization from that paper can also be implemented as a kaiming draw with leaky relu assumption.

The context is in the original github PR, but I guess it wouldn’t have hurt to write a github comment I guess, sorry for the misunderstanding.

3 Likes