l2 = 0.5*params.LC*sum(lasagne.regularization.l2(x) for x in self.network_params)
if params.updatewords:
return l2 + 0.5*params.LW*lasagne.regularization.l2(We-initial_We)
else:
return l2
In paper, the author said “All models use L2 regularization on all parameters, except for the word embeddings, which are regularized back to their initial values with an L2 penalty”.
But I don’t know how to “regularized back to their initial values”.
I have tried this. But it did not work as expected.
you can use optimizer’s weigt_decay option for L2 regularization, but it wont pull it towards initial weight initialization, it only pulls it to t-1 weight values.
You’ll have to implement something like the theano snippet yourself right after the optim.step call.
Here is how I implement my custom L2. Can anybody verify if it is correct ?
Here is getParameters function, which take all parameters of sub-model and flat it so I can get norm easily
def getParameters(self):
"""
Get flatParameters
note that getParameters and parameters is not equal in this case
getParameters do not get parameters of output module
:return: 1d tensor
"""
params = []
for m in [self.ix, self.ih, self.fx, self.fh, self.ox, self.oh, self.ux, self.uh]:
# we do not get param of output module
l = list(m.parameters())
params.extend(l)
one_dim = [p.view(p.numel()) for p in params]
params = F.torch.cat(one_dim)
return params
I add my custom L2 to err before I call backward, then step
Only err is Variable. err is output of criterion(output, target).
But l2_model and l2_emb_params, batch_size are not Variable (they are float and int)