I need to add gradient L1 regularization to my loss, and I find that only the following two times backward implementation can make its update result different from the result without gradient regularization. If torch.autograd.grad is used to replace the first backward, the updated result is the same as the result without gradient regularization, I don’t know why this happens. And whether the two backward implementation would actually work as I hoped.
the code of toy:
import torch
x = torch.randn(3,4)
fc1 = torch.nn.Linear(4,3)
fc2 = torch.nn.Linear(4,3)
fc3 = torch.nn.Linear(4,3)
fc2.weight.data.copy_(fc1.weight.data)
fc2.bias.data.copy_(fc1.bias.data)
fc3.weight.data.copy_(fc1.weight.data)
fc3.bias.data.copy_(fc1.bias.data)
opt1 = torch.optim.SGD(fc1.parameters(), lr=0.01)
opt2 = torch.optim.SGD(fc2.parameters(), lr=0.01)
opt3 = torch.optim.SGD(fc3.parameters(), lr=0.01)
# without gradient regularization
y1 = fc1(x)
loss1 = (1-y1).sum()
loss1.backward()
opt1.step()
# with gradient regularization, first implementation
y2 = fc2(x)
l2 = (1-y2).sum()
l2.backward(create_graph=True)
np2 = sum([p.grad.norm(p=1) for p in fc2.parameters()])
loss2 = l2 + np2
loss2.backward()
opt2.step()
# with gradient regularization, second implementation
y3 = fc3(x)
l3 = (1-y3).sum()
np3 = sum([g.norm(p=1) for g in torch.autograd.grad(l3, fc3.parameters(), create_graph=True) ])
loss3 = l3 + np3
loss3.backward()
opt3.step()
# print weight after update
print(fc1.weight)
print(fc2.weight) # different to fc1.weight
print(fc3.weight) # same to fc1.weight