How are layer weights and biases initialized by default?

Akash_Gupta · September 9, 2020, 2:46pm

Is it a good idea to initialize weights itself in the init function by looping over the layers?

Class Net(nn.Module):
        def  __init__():
            #...... Defining Layers
            #.......................

            for m in self.modules():
                 if isinstance(m, nn.Conv3d):
                      m.weight = nn.init.kaiming_normal(m.weight, mode='fan_out')
                 elif isinstance(m, nn.BatchNorm3d) or isinstance(m, nn.BatchNorm2d):
                      m.weight.data.fill_(1)
                      m.bias.data.zero_()
        
        def forward():

ptrblck · September 12, 2020, 7:07am

Yes, this should be working, if all modules were already initialized.

Rushirajsinh_Parmar · October 25, 2020, 9:17am

You can also use as this function:

def initialize_parameters(m):
    if isinstance(m, nn.Conv2d):
        nn.init.kaiming_normal_(m.weight.data, nonlinearity = 'relu')
        nn.init.constant_(m.bias.data, 0)
    elif isinstance(m, nn.Linear):
        nn.init.xavier_normal_(m.weight.data, gain = nn.init.calculate_gain('relu'))
        nn.init.constant_(m.bias.data, 0)

noguide · November 22, 2020, 3:52am

Could you clarify why we don’t use .data attribute anymore with nn.init? The documentation for the init functions refer to input Tensors and not Parameters still.

(Also, as an aside: I lurk on the Pytorch forums a lot, thank you for all your extremely helpful responses.)

ptrblck · November 22, 2020, 7:31am

Alban explains it here nicely.

thats_groes · December 20, 2020, 10:14am

I though it was init.kaiming_uniform as per line 87 here?

github.com

pytorch/pytorch/blob/master/torch/nn/modules/linear.py#L87


    self.in_features = in_features
    self.out_features = out_features
    self.weight = Parameter(torch.Tensor(out_features, in_features))
    if bias:
        self.bias = Parameter(torch.Tensor(out_features))
    else:
        self.register_parameter('bias', None)
    self.reset_parameters()
def reset_parameters(self) -> None:
    init.kaiming_uniform_(self.weight, a=math.sqrt(5))
    if self.bias is not None:
        fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
        bound = 1 / math.sqrt(fan_in)
        init.uniform_(self.bias, -bound, bound)
def forward(self, input: Tensor) -> Tensor:
    return F.linear(input, self.weight, self.bias)
def extra_repr(self) -> str:
    return 'in_features={}, out_features={}, bias={}'.format(

mrTsjolder · April 28, 2021, 2:27pm

Kaiming uniform would initialise with variance 2 / fan_in.
However, with a=math.sqrt(5), the initialisation ends up with a variance 1 / (3 * fan_in), which does not correspond to any standard initialisation scheme.

I refer to my earlier reply for more details…

saba · July 19, 2021, 1:29am

Hi Ptrblck,

I hope you are well. Sorry, I wan to train my classifier with 10 ensembles. The different between ensembles are the order of the subjects that I used for creating data. I want to be sure that for each ensemble the weight initialization is different from other. ensembles are running in parallel jobs independent from each other.

I just define my model and use it directly . can I be sure that for each ensemble the weight initialization is different ? for example randomly be different.

ptrblck · July 19, 2021, 5:39am

Yes, as long as you don’t set the random seed before initializing each module, the parameters would be different. You can check (some or all) via print(modelX.my_layer.param) where the X denotes the current model and would see that the same parameter would have different values after initializing all models.

saba · July 20, 2021, 6:23am

many thanks. I really appreciate your help

Vikram_M · December 9, 2021, 8:55am

What difference would it make if we don’t set nonlinearity=‘relu’ while the nonlinearity is layer is relu, because the default nonlinearity is leaky relu

ptrblck · December 9, 2021, 9:27am

The calculate_gain is using the specified nonlinearity as seen here:

Return the recommended gain value for the given nonlinearity function.
    The values are as follows:
    ================= ====================================================
    nonlinearity      gain
    ================= ====================================================
    Linear / Identity :math:`1`
    Conv{1,2,3}D      :math:`1`
    Sigmoid           :math:`1`
    Tanh              :math:`\frac{5}{3}`
    ReLU              :math:`\sqrt{2}`
    Leaky Relu        :math:`\sqrt{\frac{2}{1 + \text{negative\_slope}^2}}`
    SELU              :math:`\frac{3}{4}`
    ================= ====================================================

which is then used to compute std as seen here.

willdone1337 · December 12, 2021, 9:06pm

[quote=“Rushirajsinh_Parmar, post:44, topic:13073”]

isinstance(m, nn.Linear):
        nn.init.xavier_normal_(m.weight.data, gain = nn.init.calculate_gain('relu'))
        nn.init.constant_(m.bias.data, 0)

[/quote] HI. Assume that using xavier in linear layers isnt good idea. Have you got good results with this init? Because during this from init.py

x = torch.randn(512)
import math
std = math.sqrt(2.0 / float(512 + 512))
a = math.sqrt(3.0) * std

for i in range(100):
y = torch.Tensor(512,512).uniform_(-a,a)
x = torch.relu(x@y)

print(x.mean(),x.std())
tensor(8.1831e-16) tensor(1.0922e-15)

this tensors are very small for calculating grad

Rushirajsinh_Parmar · December 13, 2021, 5:36am

Hi, I’m not sure about the custom initialization, I haven’t tried this yet, I’ve been sticking to xavier initialisation for almost all the applications but will give this a try to see how it compares with xavier!

willdone1337 · December 24, 2021, 12:01pm

Hey . this is implementation of xavier init from source code of pytorch and also i read that xavier is better if use symmetric functions like sigm or tanh and cause of that i experiment like that (with relu) and get confidence that using xaiver without sigm or tanh is not good approach. share recently received learning with you

willdone1337 · December 24, 2021, 12:06pm

if you had better result with xavier init where you had non symmetric function instead of without xavier , let me know i will be glad. because if i have apply sigmoid function in my little code that return me more reliable gradients. also you can check it yourself)))

ptrblck · September 24, 2023, 1:43am

Your encoder_net works fine for inputs in the shape [batch_size, 3, 32, 32]:

encoder_net = nn.Sequential(
              nn.Conv2d(3, 64, 4, stride=2, padding=0),
              nn.ReLU(),
              nn.Conv2d(64, 128, 4, stride=2, padding=0),
              nn.ReLU(),
              nn.Conv2d(128, 512, 4, stride=2, padding=0),
              nn.ReLU(),
              # nn.Conv2d(256, 512, 2, stride=2, padding=0),
              # nn.ReLU(),
              nn.Flatten(),
              nn.Linear(2048,32)  #512*4
)

x = torch.randn(16, 3, 32, 32)
out = encoder_net(x)
print(out.shape)
# torch.Size([16, 32])

ptrblck · September 24, 2023, 11:52am

I’m unsure where the error is coming from, but could you try to slim down the code a bit and check which model fails with which input shapes?