Initialize the weights in layers -- Which method is most recommended?

Dear experienced friends,

These days I roamed around our PyTorch Forums and tried to find a way to initialize the weight matrix. And I found several ways to achieve that. May I ask which one would you recommend most?


Suppose we have a very simple (but typical) neural network. And our target is to initialize the weight in the first conv1 layer as [[0.,0.,0.],[1.,1.,1.],[2.,2.,2.]]. (As a 3*3 filter)

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.pool = nn.MaxPool2d(2, 2)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))

        return x

So here are serval ways that we can initialize the weights: (Huge respect to vmirly1, ptrblck, et al.)

  • Method 1 Define the customize weight matrix inside the __init__:
class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 3, 3)
        self.pool = nn.MaxPool2d(2, 2)

        K = torch.tensor([[0.,0.,0.],[1.,1.,1.],[2.,2.,2.]])  # add the weight here
        K = torch.unsqueeze(torch.unsqueeze(K,0),0) # assign it to the cov1
        self.conv1.weight.data = self.conv1.weight.data * 0 + K

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))

        return x

  • Method 2: Define the weight after you build the instance (before your training, of course)
net = Net()

# then change the weights outside the class
K = torch.tensor([[0.,0.,0.],[1.,1.,1.],[2.,2.,2.]])  
K = torch.unsqueeze(torch.unsqueeze(K,0),0) 
net.conv1.weight.data = net.conv1[0].weight.data *0 + K 

  • Method 3 Use the saved-state-dict to update the weights

  • Method 4 Use a class method to achieve that (from tutorials)
class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 3, 3)
        self.pool = nn.MaxPool2d(2, 2)
        self.init_weights()
    
    def init_weights(self):
        K = torch.tensor([[0.,0.,0.],[1.,1.,1.],[2.,2.,2.]])
        K = torch.unsqueeze(torch.unsqueeze(K,0),0)
        self.conv1.weight.data = K

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        return x

I think these are all the available methods that I can find on the Internet. May I ask which one is most recommended? Or any one of them is risky?

Thank you in advance!

I personally prefer method 4 because it can be really convenient when, for example, you want to initialize the weights of multiple layers. Here’s one example of when it can be helpful:

    def initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, nonlinearity='relu')

                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)

            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

            elif isinstance(m, nn.Linear):
                nn.init.kaiming_normal_(m.weight, nonlinearity='relu')

                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
3 Likes

I also like @Superklez’s approach for the same mentioned reason. In case you just want to assign known values to a single layer, I would probably use “Method 1”.

While your approaches would work fine, I would not recommend to use the .data attribute in any of them, as it might yield unwanted side effects. You could assign a new nn.Parameter to the weight attribute directly (and by wrapping it into a with torch.no_grad() block if necessary), use the nn.init methods as seen in @Superklez’s code, or the .copy_ method in case you want to assign the values directly to a parameter.

2 Likes

Thank you for your suggestion, Superklez. Your example greatly explains how to use control-flow to initialize distinct layers. Really helpful!

Hi ptrblck, thank you a lot for your explanation. I just test the method you mentioned, it works great as follows:

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 3, 3)
        self.pool = nn.MaxPool2d(2, 2)

        # now assign the parameters
        K = torch.tensor([[0.,0.,0.],[1.,1.,1.],[2.,2.,2.]])
        K = torch.unsqueeze(torch.unsqueeze(K,0),0)
        
        with torch.no_grad():
            self.conv1.weight = nn.Parameter(K)
    
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        return x

net = Net()

However, I am quite confused with torch.no_grad() here. As it is mentioned in the document, with torch.no_grad() could set the require gradient into False and stop the gradient calculation. Nevertheless, when I print out the parameters, all of them are still trainable. May I ask why this happen?

for param in net.parameters():
    print(param)

Parameter containing:
tensor([[[[0., 0., 0.],
          [1., 1., 1.],
          [2., 2., 2.]]]], requires_grad=True)
Parameter containing:
tensor([ 0.0328, -0.2026, -0.2669], requires_grad=True)

torch.no_grad() will make sure that the operations inside the block are not tracked by Autograd and thus not recorded in the computation graph (as you don’t want to backpropagate through the parameter assignment).
The nn.Parameter itself should keep its requires_grad=True attribute.

2 Likes

Aha Got it! Thank you so much for the explanation, ptrblck. Really helpful! :clap: :clap: