I have a neural network that I pretrain on Dataset A and then finetune on Dataset B - before finetuning I add a dense layer on top of the model (red arrow) that I would like to regularise. However, it does not seem to work properly: either the performance drops very low even with tiny regularisation weights (0.01 - 0.08 weight range, f1 drops from around 22% to 12% on the dev set) *or I get the exact same F1 score for all weights (as if I train the same model several times with no changes).* <-- this does not occur anymore

(Since new users can only post one image for whatever reason, I have to resort to a screenshot here…)

Reg. layer initialisation:

`self.reg_matrix =`

`torch.cat((torch.eye(42), torch.zeros(42, 200)), dim=1).cuda()`

Reg. term (adapt is the dense layer **W**):

`squared_sum = torch.sub(self.model.adapt.weight, self.reg_matrix).pow(2).sum()`

Loss:

`loss += self.weight * squared_sum`

I originally assumed that the tape wasn’t recording all operations properly since I hadn’t used PyTorch operations for the calculation of the L2 reg. term, but even when I do it doesn’t seem to work.

This is the function I’m adapting, so the previous code is around line 42 here (loss function is unchanged):

Results (F1 (micro)):

```
0.0 => 22.361%
0.01 => 13.309%
0.05 => 11.350%
1.0 => 12.084%
```

F1 scores from evaluation would be something like 42% with 0.0 and around 17% to 13% or so with regularisation. The results of the original model without pretraining is 65%.

Any ideas would be appreciated.