You could implement L! regularization using something like example of L2 regularization.
For L1 regularization, you should change W.norm(2) to W.norm(p=1).
Since the L1 regularizer is not differentiable everywhere, what does PyTorch do when it encounters differentiating this functions? A simple example shows PyTorch returns zero.
import torch
x = torch.linspace(-1.0, 1.0, 5, requires_grad=True)
y = torch.abs(x)
y[2].backward()
print(x.grad)
I think a zero gradient is expected to be returned for zero inputs and would fit the idea of a regularizer since no penalty should be added to a value which is already at zero. All other values will get valid gradients:
That makes sense. Is PyTorch using a specific algorithm to compute the gradient at this non-differentiable point? Is there an academic reference that discusses this behaviour that I can read?