How are layer weights and biases initialized by default?

ptrblck · November 20, 2018, 3:22pm

Not for each layer, but just the layer type.
I would use it, as model.apply() will call the function on all modules, i.e. also the model itself.
If you have some modules without parameters like nn.ReLU you’ll get an error.

drcege · December 13, 2018, 5:22am

torch.nn.init.zero_() does exit, but i never appears in the docs……

whoab · April 8, 2019, 8:15am

Has this answer changed? Looking at the current linear.py, we see:

        init.kaiming_uniform_(self.weight, a=math.sqrt(5))

Where does the sqrt(5) come from? According to the source it’s supposed to be the “negative slope of the activation function used after this layer” which makes no sense because sqrt(5) is positive…

EDIT: upon loooking at it again, it seems to be for leaky relu. Still, sqrt(5) is positive, and I think most models still use ReLU so not sure why this is how initialized by default…

Mdhvince · April 24, 2019, 1:16pm

Hi @ptrblck !

Where in the programme should I add weights_init() function that you’ve defined ? Just after calling my model class ?
e.g
If I have a class Net() with my Network architecture, then In the main part I can do the following ?

model = Net() 
def weights_init(m):
    if isinstance(m, nn.Conv2d):
        xavier(m.weight.data)
        xavier(m.bias.data)

model.apply(weights_init)
optimizer=...
criterion=...

# Then the training Loop...

Is it a good way to initialize weight, then train my model ?

Thanks

ptrblck · April 24, 2019, 1:56pm

Yes, it looks good.
Some small side notes: .data shouldn’t be used anymore, so just use the inplace init methods directly passing the parameters:

def weights_init(m):
    if isinstance(m, nn.Conv2d):
        torch.nn.init.xavier_uniform_(m.weight)
        torch.nn.init.zeros_(m.bias)

model.apply(weights_init)

Mdhvince · April 24, 2019, 2:10pm

Excellent !

Thank you for your quick answer!

hamfry · July 10, 2019, 9:23pm

Thanks.

This post might be useful for some as well:
https://stackoverflow.com/a/49433937/5609823

rebeen · November 28, 2019, 9:36am

Please what is the difference between using unit_weight() function or without using this function? I have seen many simple models without this function so I am wondering when we should use this function to initialize the weight of the layers
Thank you

rebeen · November 29, 2019, 12:20pm

Why Initialize Weights

The aim of weight initialization is to prevent layer activation outputs from exploding or vanishing during the course of a forward pass through a deep neural network. If either occurs, loss gradients will either be too large or too small to flow backwards beneficially, and the network will take longer to converge, if it is even able to do so at all.
source

arkajit_bhattacharya · March 30, 2020, 1:29am

Hi,
I was looking into the default weight initialization code for the convolutional layer.
It looks a bit complicated. It would be really helpful if a simpler explanation is there.
It would be really helpful

ptrblck · March 30, 2020, 1:40am

I assume you are referring to reset_parameters().
Do you have a question regarding the weight or bias initialization?

arkajit_bhattacharya · April 4, 2020, 8:31pm

Yes. reset_parameters() basically suggests that by default pytorch follows kaiming initialization for the weights. Kindly let me know if my understanding is correct

mrTsjolder · April 29, 2020, 8:15am

One answer in this older thread suggests that the initialisation resembles what is referred to “LeCun Initialisation”. This comment is probably long overdue, but pytorch does not implement LeCun or He/Kaiming initialisation for the Linear module.

If we go through the code (v1.5.0) of Linear.reset_parameters, the first line initialises the weight matrix:
init.kaiming_uniform_(self.weight, a=math.sqrt(5)). If we take a look at how kaiming_uniform is implemented, we find that this line is equivalent to

fan = tensor.size(1)  # fan-in for linear, as computed by _calculate_correct_fan
gain = math.sqrt(2.0 / (1 + a ** 2))  # gain, as computed by calculate_gain
std = gain / math.sqrt(fan)
bound = math.sqrt(3.0) * std
with torch.no_grad():
    return tensor.uniform_(-bound, bound)

Since a = math.sqrt(5) the weights are initialised with std = 1 / math.sqrt(3.0 * fan_in). For reference, LeCun initialisation would be 1 / math.sqrt(fan_in) and He initialisation uses math.sqrt(2 / fan_in).

The bias initialisation in Linear.reset_parameters reveals another problem. Although biases are normally initialised with zeros (for the sake of simplicity), the idea is probably to initialise the biases with std = math.sqrt(1 / fan_in) (cf. LeCun init). By using this value for the boundaries of the uniform distribution, the resulting distribution has std math.sqrt(1 / 3.0 * fan_in), which happens to be the same as the standard deviation for the weights.

A more reasonable default for me would be to use LeCun initialisation (since this has been the go-to standard since 1998). I could also understand Kaiming initialisation as the default, because everyone is using ReLU activation functions everywhere anyway (although I have a feeling that this is not necessarily the case for people working with fully connected networks). Some time ago, I submitted a pull request to adopt LeCun initialisation as the default, but there seems to be little incentive to actually make changes due to backward compatibility.

This probably also explains why pytorch ended up with its own initialisation strategy for fully connected networks. Someone must have forgotten about the fact that a uniform distribution with bounds -b, b has a standard deviation of b / math.sqrt(3) instead of just b. Due to backwards compatibility this got stuck and no-one is willing to make the change to the more widely accepted and standard initialisation.

RAHUL_Jaiswal · June 26, 2020, 6:56am

Hello,
class DQN(nn.Module):
def init(self, input_shape, n_actions):
super(DQN, self).init()
self.fc = nn.Sequential(
nn.Linear(input_shape, 32),
nn.ReLU(),
nn.Linear(32, 64),
nn.ReLU(),
nn.Linear(64, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, n_actions)
)

def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.xavier_normal_(tensor, gain=1.0)
        m.bias.data.fill_(0.01)
    
def forward(self, x):
    return self.fc(x).apply(init_weights)

while using this architecture and weight initialization technique,
I am getting this errror:
def forward(self, x):
return self.fc(x).apply(init_weights)
AttributeError: ‘Tensor’ object has no attribute ‘apply’

can somebody help me with this ?

harsha_g · June 26, 2020, 6:58am

model = DQN()
model.apply(init_weights)

Brando_Miranda · September 4, 2020, 7:52pm

is the .data field something we should still use/modify in 2020? perhaps I’m miss remembering.

ptrblck · September 5, 2020, 4:54am

No, don’t use the .data attribute as mentioned in this previous post.

Brando_Miranda · September 5, 2020, 12:35pm

ah ok, but I think it was giving me some sort of error because I was trying to modify the leafs manually to do a custom initialization…

ptrblck · September 5, 2020, 6:39pm

Try to wrap the initialization code into with torch.no_grad(), which should resolve this error.
If not, could you post a code snippet to reproduce it?

Brando_Miranda · September 7, 2020, 3:33pm

I think that worked! It wasn’t working when I tried w += w + 0.01 but now it is with the no torch grad context you suggested:

        with torch.no_grad():
            #for i in range(len(base_model)):
            for i, w in enumerate(base_model.parameters()):
                print(f'--- i = {i}')
                print(w)
                w += w + 0.001
                print(w)

output:

for i, w in enumerate(base_model.parameters()):
  ...:     print(f'--- i = {i}')
  ...:     print(w)
  ...:     
--- i = 0
Parameter containing:
tensor([[0.1010],
        [0.1010],
        [0.1010],
        [0.1010],
        [0.1010],
        [0.1010],
        [0.1010],
        [0.1010],
        [0.1010],
        [0.1010]], requires_grad=True)
--- i = 1
Parameter containing:
tensor([0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
        0.1010], requires_grad=True)
--- i = 2
Parameter containing:
tensor([[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010]], requires_grad=True)
--- i = 3
Parameter containing:
tensor([0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
        0.1010], requires_grad=True)
--- i = 4
Parameter containing:
tensor([[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010]], requires_grad=True)
--- i = 5
Parameter containing:
tensor([0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
        0.1010], requires_grad=True)
--- i = 6
Parameter containing:
tensor([[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010],
        [0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010]], requires_grad=True)
--- i = 7
Parameter containing:
tensor([0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
        0.1010], requires_grad=True)
--- i = 8
Parameter containing:
tensor([[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
         0.1010]], requires_grad=True)
--- i = 9
Parameter containing:
tensor([0.1010], requires_grad=True)

I thought for a second that it didn’t give me pointers to the actual objects I was trying to modify.