In certain situations, I would like to use custom weights in certain layers (e.g., from previously trained models), however, I noticed that if I assign weight and bias values manually, they don’t seem to update during training. Below is a simplified example using “normal_” and “zero_”. Here, if I comment out the lines
the model learns. This suggests there’s something from about my approach. Since self.linear_*.weight is already a parameter instance, I thought overwriting *.weight and *.bias would be enough, but it doesn’t seem that way.
Would be nice if someone could shed some light onto this issue!
class Model(torch.nn.Module):
def __init__(self, num_features, num_classes):
super(Model, self).__init__()
self.linear_1 = torch.nn.Linear(num_features, num_hidden_1)
self.linear_1.weight.data.normal_(0.0, 0.1)
self.linear_1.bias.data.zero_()
self.linear_2 = torch.nn.Linear(num_hidden_1, num_hidden_2)
self.linear_2.weight.data.normal_(0.0, 0.1)
self.linear_2.bias.data.zero_()
self.linear_out = torch.nn.Linear(num_hidden_2, num_classes)
self.linear_out.weight.data.normal_(0.0, 0.1)
self.linear_out.bias.data.zero_()
def forward(self, x):
out = self.linear_1(x)
out = F.relu(out)
out = self.linear_2(out)
out = F.relu(out)
out = self.linear_out(out)
out = F.softmax(out, dim=1)
return out
torch.manual_seed(0)
model = Model(num_features=num_features,
num_classes=num_classes)
good point. However, when I remove the custom weight init, the model learns (weights and biases are being updated instead of frozen), so I would assume it should independent of the loss?
Side question: Which loss should then be used with softmax in the last layer?
That’s strange. Using dummy data, the weights get updated in both cases (initializing the parameters and without).
What are the gradients after the backward() call?
hm … I cannot reproduce this issue anymore. I think it was related to one or more bugs/misconceptions. E.g., I didn’t know that the ToTensor() transform in pytorchvision already scales pixels values from [0, 255] to [0, 1] so that I normalized input images twice, which had super small values after that (i.e., pixel/255/255), which probably caused the network not to learn with the suboptimal weight initialization and passing probabilities to the CrossEntropyLoss. Anyway, thanks for looking into that!