I have a neural network that I pretrain on Dataset A and then finetune on Dataset B - before finetuning I add a dense layer on top of the model (red arrow) that I would like to regularise. However, it does not seem to work properly: either the performance drops very low even with tiny regularisation weights (0.01 - 0.08 weight range, f1 drops from around 22% to 12% on the dev set) or I get the exact same F1 score for all weights (as if I train the same model several times with no changes). <-- this does not occur anymore
(Since new users can only post one image for whatever reason, I have to resort to a screenshot here…)
Reg. layer initialisation:
torch.cat((torch.eye(42), torch.zeros(42, 200)), dim=1).cuda()
Reg. term (adapt is the dense layer W):
squared_sum = torch.sub(self.model.adapt.weight, self.reg_matrix).pow(2).sum()
loss += self.weight * squared_sum
I originally assumed that the tape wasn’t recording all operations properly since I hadn’t used PyTorch operations for the calculation of the L2 reg. term, but even when I do it doesn’t seem to work.
This is the function I’m adapting, so the previous code is around line 42 here (loss function is unchanged):
Results (F1 (micro)):
0.0 => 22.361% 0.01 => 13.309% 0.05 => 11.350% 1.0 => 12.084%
F1 scores from evaluation would be something like 42% with 0.0 and around 17% to 13% or so with regularisation. The results of the original model without pretraining is 65%.
Any ideas would be appreciated.