I have a couple of questions regarding gradients and specifically how gradients are applied in pytorch.
I planned this example to understand gradient flow and how they are applied to the weights, and how they reach the input themselves.
I kept the chain rule in mind, and I wanted to look for them in the gradient values.
In this experiment, I simply want to reach from input to itself, via two objective functions, and I wanted to see the effect of L1 and L2, on values over 1 (in here 10).
I have only one linear layer, and each layer with one weight and no bias.
I set all the weights to 0.1, so 1 is the objective.
What I am seeing here is that in L2 loss, the difference is huge and that makes sense, and the gradient that is applied to the weights is also larger compared to other cases.
my questions:
We have the same input, but in the weights, however, we see the difference:
input gradients: tensor([0.9000]) tensor([0.9000]) tensor([1.6200]) tensor([16.2000])
weight gradients:, l21.weight.grad, l22.weight.grad, l23.weight.grad, l24.weight.grad
weight gradients: tensor([[-0.5000],
[-0.5000]]) tensor([[-5.],
[-5.]]) tensor([[-0.9000],
[-0.9000]]) tensor([[-90.],
[-90.]])
In the case of L1, since the numbers are equal, we have the same weights. I expect that in multiplication, the gradients should be the same, and in L1, we should have the same slope. We have the same gradients in the input, but for the weights, we have a 10 times difference.
import torch
import torch.nn as nn
import random,numpy
# some are not necessary
def set_seed(seed):
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
numpy.random.seed(seed)
random.seed(seed)
set_seed(0)
torch.set_printoptions(precision=10)
l21=nn.Linear(1,1,bias=False)
l22=nn.Linear(1,1,bias=False)
l23=nn.Linear(1,1,bias=False)
l24=nn.Linear(1,1,bias=False)
optimizer = torch.optim.SGD(list(l21.parameters())+list(l22.parameters())+list(l23.parameters())+list(l24.parameters()), lr=0.00001, momentum=0)
l21.weight.data=torch.Tensor([[0.1]])
l22.weight.data=torch.Tensor([[0.1]])
l23.weight.data=torch.Tensor([[0.1]])
l24.weight.data=torch.Tensor([[0.1]])
n1=nn.L1Loss()
n2=nn.MSELoss()
a1=torch.Tensor([1])
a2=torch.Tensor([10])
a3=torch.Tensor([1])
a4=torch.Tensor([10])
a1.requires_grad\
=a2.requires_grad\
=a3.requires_grad\
=a4.requires_grad\
=True
while(True):
ol1_1=l21(a1)
ol1_10 = l22(a2)
ol2_1 = l23(a3)
ol2_10 = l24(a4)
ls11=n1(ol1_1,a1)
ls12 = n1(ol1_10, a2)
ls21 = n2(ol2_1, a3)
ls22 = n2(ol2_10, a4)
optimizer.zero_grad()
if a1.grad is not None:
a1.grad.zero_()
a2.grad.zero_()
a3.grad.zero_()
a4.grad.zero_()
ls11.backward()
ls12.backward()
ls21.backward()
ls22.backward()
print('a1, a2, a3, a4')
print(a1,a2,a3,a4)
print('input gradients:,a1.grad,a2.grad,a3.grad,a4.grad')
print('input gradients:',a1.grad,a2.grad,a3.grad,a4.grad)
print('weight gradients:, l21.weight.grad, l22.weight.grad, l23.weight.grad, l24.weight.grad')
print('weight gradients:',l21.weight.grad,l22.weight.grad,l23.weight.grad,l24.weight.grad)
optimizer.step()
print('loss values:,ls11,ls12,ls21,ls22')
print('loss values:',ls11,ls12,ls21,ls22)
print('weight values:,l21.weight.data,l22.weight.data,l23.weight.data,l24.weight.data')
print('weight values:',l21.weight.data,l22.weight.data,l23.weight.data,l24.weight.data)
print('***********')