Weights are not being updated when using custom values for the .weight and .bias tensors

rasbt · January 29, 2018, 3:21pm

Hi,

In certain situations, I would like to use custom weights in certain layers (e.g., from previously trained models), however, I noticed that if I assign weight and bias values manually, they don’t seem to update during training. Below is a simplified example using “normal_” and “zero_”. Here, if I comment out the lines

        self.linear_*.weight.data.normal_(0.0, 0.1)
        self.linear_*.bias.data.zero_()

the model learns. This suggests there’s something from about my approach. Since self.linear_*.weight is already a parameter instance, I thought overwriting *.weight and *.bias would be enough, but it doesn’t seem that way.

Would be nice if someone could shed some light onto this issue!

class Model(torch.nn.Module):

    def __init__(self, num_features, num_classes):
        super(Model, self).__init__()
        
        self.linear_1 = torch.nn.Linear(num_features, num_hidden_1)
        self.linear_1.weight.data.normal_(0.0, 0.1)
        self.linear_1.bias.data.zero_()
        
        self.linear_2 = torch.nn.Linear(num_hidden_1, num_hidden_2)
        self.linear_2.weight.data.normal_(0.0, 0.1)
        self.linear_2.bias.data.zero_()
        
        self.linear_out = torch.nn.Linear(num_hidden_2, num_classes)
        self.linear_out.weight.data.normal_(0.0, 0.1)
        self.linear_out.bias.data.zero_()
        
    def forward(self, x):
        out = self.linear_1(x)
        out = F.relu(out)
        out = self.linear_2(out)
        out = F.relu(out)
        out = self.linear_out(out)
        out = F.softmax(out, dim=1)
        return out

    
torch.manual_seed(0)
model = Model(num_features=num_features,
              num_classes=num_classes)

richard · January 29, 2018, 3:29pm

Could you check if the gradients for those parameters are zero? What you’re doing should be sufficient to add custom weights.

ptrblck · January 29, 2018, 3:38pm

All linear layers seem to have gradients.

If I may speculate a bit… What kind of loss are you using?
Supposing CrossEntropyLoss, you should remove the softmax layer in your model.

This criterion combines LogSoftMax and NLLLoss in one single class.
The input is expected to contain scores for each class.

rasbt · January 29, 2018, 3:39pm

Yeah, the gradients seem to be zeroed successfully in each iteration. I checked this as follows

for epoch in range(num_epochs):
    for batch_idx, (features, targets) in enumerate(train_loader):
        
        features = Variable(features.view(-1, 28*28))
        targets = Variable(targets)
        
        if torch.cuda.is_available():
            features, targets = features.cuda(), targets.cuda()
            
        outputs = model(features)
        cost = cost_fn(outputs, targets)
        optimizer.zero_grad()
        
        print(model.linear_1.weight.grad)
        print(model.linear_1.bias.grad)
        cost.backward()
        optimizer.step()

rasbt · January 29, 2018, 3:41pm

good point. However, when I remove the custom weight init, the model learns (weights and biases are being updated instead of frozen), so I would assume it should independent of the loss?

Side question: Which loss should then be used with softmax in the last layer?

ptrblck · January 29, 2018, 3:44pm

As far as I know, log_softmax + NLLLoss is the way to go. Alternatively logits + CrossEntropyLoss.

You are right, that doesn’t answer the issue. Let me check you model again.

ptrblck · January 29, 2018, 3:50pm

That’s strange. Using dummy data, the weights get updated in both cases (initializing the parameters and without).
What are the gradients after the backward() call?

rasbt · January 29, 2018, 4:52pm

hm … I cannot reproduce this issue anymore. I think it was related to one or more bugs/misconceptions. E.g., I didn’t know that the ToTensor() transform in pytorchvision already scales pixels values from [0, 255] to [0, 1] so that I normalized input images twice, which had super small values after that (i.e., pixel/255/255), which probably caused the network not to learn with the suboptimal weight initialization and passing probabilities to the CrossEntropyLoss. Anyway, thanks for looking into that!