How to add gradient L1 regularization to loss?

I need to add gradient L1 regularization to my loss, and I find that only the following two times backward implementation can make its update result different from the result without gradient regularization. If torch.autograd.grad is used to replace the first backward, the updated result is the same as the result without gradient regularization, I don’t know why this happens. And whether the two backward implementation would actually work as I hoped.

the code of toy:

import torch
x = torch.randn(3,4)
fc1 = torch.nn.Linear(4,3)
fc2 = torch.nn.Linear(4,3)
fc3 = torch.nn.Linear(4,3)
opt1 = torch.optim.SGD(fc1.parameters(), lr=0.01)
opt2 = torch.optim.SGD(fc2.parameters(), lr=0.01)
opt3 = torch.optim.SGD(fc3.parameters(), lr=0.01)

# without gradient regularization
y1 = fc1(x)
loss1 = (1-y1).sum()

# with gradient regularization, first implementation
y2 = fc2(x)
l2 = (1-y2).sum()
np2 = sum([p.grad.norm(p=1) for p in fc2.parameters()])
loss2 = l2 + np2

# with gradient regularization, second implementation
y3 = fc3(x)
l3 = (1-y3).sum()
np3 = sum([g.norm(p=1) for g in  torch.autograd.grad(l3, fc3.parameters(), create_graph=True) ])
loss3 = l3 + np3

# print weight after update
print(fc2.weight)  # different to fc1.weight
print(fc3.weight)  # same to fc1.weight

I seem to know where I went wrong. The second derivative of the linear function is 0, which makes torch.autograd.grad same to without gradients norm, but not if I make it be nonlinear function.
Looks like the torch.autograd.grad implementation is the right one? If anyone can help confirm would be greatly appreciated!