# How are layer weights and biases initialized by default?

No, don’t use the .data attribute as mentioned in this previous post.

1 Like

ah ok, but I think it was giving me some sort of error because I was trying to modify the leafs manually to do a custom initialization…

Try to wrap the initialization code into with torch.no_grad(), which should resolve this error.
If not, could you post a code snippet to reproduce it?

1 Like

I think that worked! It wasn’t working when I tried w += w + 0.01 but now it is with the no torch grad context you suggested:

        with torch.no_grad():
#for i in range(len(base_model)):
for i, w in enumerate(base_model.parameters()):
print(f'--- i = {i}')
print(w)
w += w + 0.001
print(w)


output:

for i, w in enumerate(base_model.parameters()):
...:     print(f'--- i = {i}')
...:     print(w)
...:
--- i = 0
Parameter containing:
tensor([[0.1010],
[0.1010],
[0.1010],
[0.1010],
[0.1010],
[0.1010],
[0.1010],
[0.1010],
[0.1010],
--- i = 1
Parameter containing:
tensor([0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
--- i = 2
Parameter containing:
tensor([[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
--- i = 3
Parameter containing:
tensor([0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
--- i = 4
Parameter containing:
tensor([[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
--- i = 5
Parameter containing:
tensor([0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
--- i = 6
Parameter containing:
tensor([[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
0.1010],
[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
--- i = 7
Parameter containing:
tensor([0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
--- i = 8
Parameter containing:
tensor([[0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010, 0.1010,
--- i = 9
Parameter containing:


I thought for a second that it didn’t give me pointers to the actual objects I was trying to modify.

Is it a good idea to initialize weights itself in the init function by looping over the layers?

Class Net(nn.Module):
def  __init__():
#...... Defining Layers
#.......................

for m in self.modules():
if isinstance(m, nn.Conv3d):
m.weight = nn.init.kaiming_normal(m.weight, mode='fan_out')
elif isinstance(m, nn.BatchNorm3d) or isinstance(m, nn.BatchNorm2d):
m.weight.data.fill_(1)
m.bias.data.zero_()

def forward():


Yes, this should be working, if all modules were already initialized.

You can also use as this function:

def initialize_parameters(m):
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight.data, nonlinearity = 'relu')
nn.init.constant_(m.bias.data, 0)
elif isinstance(m, nn.Linear):
nn.init.xavier_normal_(m.weight.data, gain = nn.init.calculate_gain('relu'))
nn.init.constant_(m.bias.data, 0)


Could you clarify why we don’t use .data attribute anymore with nn.init? The documentation for the init functions refer to input Tensors and not Parameters still.

(Also, as an aside: I lurk on the Pytorch forums a lot, thank you for all your extremely helpful responses.)

Alban explains it here nicely.

I though it was init.kaiming_uniform as per line 87 here?

Kaiming uniform would initialise with variance 2 / fan_in.
However, with a=math.sqrt(5), the initialisation ends up with a variance 1 / (3 * fan_in), which does not correspond to any standard initialisation scheme.

I refer to my earlier reply for more details…

Hi Ptrblck,

I hope you are well. Sorry, I wan to train my classifier with 10 ensembles. The different between ensembles are the order of the subjects that I used for creating data. I want to be sure that for each ensemble the weight initialization is different from other. ensembles are running in parallel jobs independent from each other.

I just define my model and use it directly . can I be sure that for each ensemble the weight initialization is different ? for example randomly be different.

Yes, as long as you don’t set the random seed before initializing each module, the parameters would be different. You can check (some or all) via print(modelX.my_layer.param) where the X denotes the current model and would see that the same parameter would have different values after initializing all models.

1 Like

many thanks. I really appreciate your help

What difference would it make if we don’t set nonlinearity=‘relu’ while the nonlinearity is layer is relu, because the default nonlinearity is leaky relu

The calculate_gain is using the specified nonlinearity as seen here:

Return the recommended gain value for the given nonlinearity function.
The values are as follows:
================= ====================================================
nonlinearity      gain
================= ====================================================
Linear / Identity :math:1
Conv{1,2,3}D      :math:1
Sigmoid           :math:1
Tanh              :math:\frac{5}{3}
ReLU              :math:\sqrt{2}
Leaky Relu        :math:\sqrt{\frac{2}{1 + \text{negative\_slope}^2}}
SELU              :math:\frac{3}{4}
================= ====================================================


which is then used to compute std as seen here.

1 Like

[quote=“Rushirajsinh_Parmar, post:44, topic:13073”]

isinstance(m, nn.Linear):
nn.init.xavier_normal_(m.weight.data, gain = nn.init.calculate_gain('relu'))
nn.init.constant_(m.bias.data, 0)


[/quote] HI. Assume that using xavier in linear layers isnt good idea. Have you got good results with this init? Because during this from init.py

x = torch.randn(512)
import math
std = math.sqrt(2.0 / float(512 + 512))
a = math.sqrt(3.0) * std

for i in range(100):
y = torch.Tensor(512,512).uniform_(-a,a)
x = torch.relu(x@y)

print(x.mean(),x.std())
tensor(8.1831e-16) tensor(1.0922e-15)

this tensors are very small for calculating grad

1 Like

Hi, I’m not sure about the custom initialization, I haven’t tried this yet, I’ve been sticking to xavier initialisation for almost all the applications but will give this a try to see how it compares with xavier!

1 Like

Hey . this is implementation of xavier init from source code of pytorch and also i read that xavier is better if use symmetric functions like sigm or tanh and cause of that i experiment like that (with relu) and get confidence that using xaiver without sigm or tanh is not good approach. share recently received learning with you

if you had better result with xavier init where you had non symmetric function instead of without xavier , let me know i will be glad. because if i have apply sigmoid function in my little code that return me more reliable gradients. also you can check it yourself)))