I wanted to do it manually so I implemented it as follows:

reg_lambda=1.0
l2_reg=0
for W in mdl.parameters():
l2_reg += *W.norm(2)
batch_loss = (1/N_train)*(y_pred - batch_ys).pow(2).sum() + reg_lambda*l2_reg
## BACKARD PASS
batch_loss.backward() # Use autograd to compute the backward pass. Now w will have gradients

is this correct? the key part I care about is that the SGD update works correctly. i.e.:

## SGD update
for W in mdl.parameters():
delta = eta*W.grad.data
W.data.copy_(W.data - delta) # W - eta*g + A*gdl_eps

l2_reg here is a python scalar, so operations done on it are not recorded for the autograd backward().
Instead, you should make l2_reg to be an autograd Variable.

l2_reg = None
for W in mdl.parameters():
if l2_reg is None:
l2_reg = W.norm(2)
else:
l2_reg = l2_reg + W.norm(2)
batch_loss = (1/N_train)*(y_pred - batch_ys).pow(2).sum() + l2_reg * reg_lambda
batch_loss.backward()

File “…/lib/python3.6/site-packages/torch/autograd/variable.py”, line 146, in backward
self._execution_engine.run_backward((self,), (gradient,), retain_variables)
File “…/lib/python3.6/site-packages/torch/autograd/_functions/reduce.py”, line 200, in backward
return input.mul(grad_output[0] / self.norm)
ZeroDivisionError: float division by zero

Any suggestions ? (I guess its somehow getting self.norm as 0).

Since the parameters are Variables, won’t l2_reg be automatically converted to a Variable at the end? I’m using l2_reg=0 and it seems to work.
Also I’m not sure if OP’s formula for L2 reg is correct. You need the sum of every parameter element squared.

Further Printing the the value returned by the function, it surely seems to be going towards INF. But almost the same logic for custom regularization, works in tensorflow. Am I doing something wrong code wise.

Epoch: [0][0/391] Time 6.739 (6.739) Loss 3157501.5000 (3157501.5000) Prec@1 10.938 (10.938)
Variable containing:
2.7529e+08
[torch.cuda.FloatTensor of size 1 (GPU 0)]

Variable containing:
4.1735e+12
[torch.cuda.FloatTensor of size 1 (GPU 0)]

Variable containing:
5.0968e+27
[torch.cuda.FloatTensor of size 1 (GPU 0)]
…

I was looking for how to add a L2 norm of a parameter to the loss function. And I did as suggested above, however, there comes a bug:

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

How should I solve this problem? Thank you very much.

You are backproping through the same graph multiple times. Make sure that it is desired because most of the cases you don’t need to. If so specify the flag retain_graph. If not, find out where you are backproping more than once and fix it.

yes, it will penalize the bias terms. if you want it to not penalize the bias terms, you can easily filter them out by using the model.named_parameters() call, and not invoking the regularizer for the bias named terms.