How gradients are applied in pytorch

yegane · February 3, 2022, 6:01pm

I have a couple of questions regarding gradients and specifically how gradients are applied in pytorch.
I planned this example to understand gradient flow and how they are applied to the weights, and how they reach the input themselves.
I kept the chain rule in mind, and I wanted to look for them in the gradient values.
In this experiment, I simply want to reach from input to itself, via two objective functions, and I wanted to see the effect of L1 and L2, on values over 1 (in here 10).
I have only one linear layer, and each layer with one weight and no bias.
I set all the weights to 0.1, so 1 is the objective.

What I am seeing here is that in L2 loss, the difference is huge and that makes sense, and the gradient that is applied to the weights is also larger compared to other cases.

my questions:
We have the same input, but in the weights, however, we see the difference:

input gradients: tensor([0.9000]) tensor([0.9000]) tensor([1.6200]) tensor([16.2000])
weight gradients:, l21.weight.grad, l22.weight.grad, l23.weight.grad, l24.weight.grad
weight gradients: tensor([[-0.5000],
        [-0.5000]]) tensor([[-5.],
        [-5.]]) tensor([[-0.9000],
        [-0.9000]]) tensor([[-90.],
        [-90.]])

In the case of L1, since the numbers are equal, we have the same weights. I expect that in multiplication, the gradients should be the same, and in L1, we should have the same slope. We have the same gradients in the input, but for the weights, we have a 10 times difference.


import torch
import torch.nn as nn
import random,numpy
# some are not necessary
def set_seed(seed):
 torch.backends.cudnn.deterministic = True
 torch.backends.cudnn.benchmark = False
 torch.manual_seed(seed)
 torch.cuda.manual_seed_all(seed)
 numpy.random.seed(seed)
 random.seed(seed)

set_seed(0)
torch.set_printoptions(precision=10)

l21=nn.Linear(1,1,bias=False)
l22=nn.Linear(1,1,bias=False)
l23=nn.Linear(1,1,bias=False)
l24=nn.Linear(1,1,bias=False)



optimizer = torch.optim.SGD(list(l21.parameters())+list(l22.parameters())+list(l23.parameters())+list(l24.parameters()), lr=0.00001, momentum=0)
l21.weight.data=torch.Tensor([[0.1]])
l22.weight.data=torch.Tensor([[0.1]])
l23.weight.data=torch.Tensor([[0.1]])
l24.weight.data=torch.Tensor([[0.1]])
n1=nn.L1Loss()
n2=nn.MSELoss()
a1=torch.Tensor([1])
a2=torch.Tensor([10])
a3=torch.Tensor([1])
a4=torch.Tensor([10])

a1.requires_grad\
    =a2.requires_grad\
    =a3.requires_grad\
    =a4.requires_grad\
    =True


while(True):
    ol1_1=l21(a1)
    ol1_10 = l22(a2)
    ol2_1 = l23(a3)
    ol2_10 = l24(a4)

    ls11=n1(ol1_1,a1)
    ls12 = n1(ol1_10, a2)
    ls21 = n2(ol2_1, a3)
    ls22 = n2(ol2_10, a4)

    optimizer.zero_grad()
    if a1.grad is not None:
        a1.grad.zero_()
        a2.grad.zero_()
        a3.grad.zero_()
        a4.grad.zero_()
    ls11.backward()
    ls12.backward()
    ls21.backward()
    ls22.backward()

    print('a1, a2, a3, a4')
    print(a1,a2,a3,a4)

    print('input gradients:,a1.grad,a2.grad,a3.grad,a4.grad')
    print('input gradients:',a1.grad,a2.grad,a3.grad,a4.grad)

    print('weight gradients:, l21.weight.grad, l22.weight.grad, l23.weight.grad, l24.weight.grad')
    print('weight gradients:',l21.weight.grad,l22.weight.grad,l23.weight.grad,l24.weight.grad)

    optimizer.step()
    print('loss values:,ls11,ls12,ls21,ls22')
    print('loss values:',ls11,ls12,ls21,ls22)

    print('weight values:,l21.weight.data,l22.weight.data,l23.weight.data,l24.weight.data')
    print('weight values:',l21.weight.data,l22.weight.data,l23.weight.data,l24.weight.data)
    print('***********')

eqy · February 3, 2022, 10:49pm

Are you intentionally setting all of the layers to use the same underlying weight tensor? I believe that is what is happening with the l21.weight.data = ... = ... = ... assignment, which might be giving you the unexpected results.

In other words, when this weight tensor is used/updated, this change is shared across all the “different” layer definitions.

>>> import torch
>>> a = torch.nn.Linear(1,1,bias=False)
>>> b = torch.nn.Linear(1,1,bias=False)
>>> c = torch.nn.Linear(1,1,bias=False)
>>> a.weight.data = b.weight.data = c.weight.data = torch.tensor([[0.1],[0.1]])
>>> a.weight.data += 1
>>> a.weight.data
tensor([[1.1000],
        [1.1000]])
>>> b.weight.data
tensor([[1.1000],
        [1.1000]])
>>> c.weight.data
tensor([[1.1000],
        [1.1000]])
>>>

You might want to separately assign new torch.Tensor(...)s to the weights if you wish to avoid this.

yegane · February 3, 2022, 11:04pm

Thanks. I didn’t notice that, but the problem still is here:

a1, a2, a3, a4
tensor([1.], requires_grad=True) tensor([10.], requires_grad=True) tensor([1.], requires_grad=True) tensor([10.], requires_grad=True)
input gradients:,a1.grad,a2.grad,a3.grad,a4.grad
input gradients: tensor([0.8999999762]) tensor([0.8999999762]) tensor([1.6200000048]) tensor([16.2000007629])
weight gradients:, l21.weight.grad, l22.weight.grad, l23.weight.grad, l24.weight.grad
weight gradients: tensor([[-0.5000000000],
        [-0.5000000000]]) tensor([[-5.],
        [-5.]]) tensor([[-0.8999999762],
        [-0.8999999762]]) tensor([[-90.],
        [-90.]])
loss values:,ls11,ls12,ls21,ls22
loss values: tensor(0.8999999762, grad_fn=<L1LossBackward>) tensor(9., grad_fn=<L1LossBackward>) tensor(0.8099999428, grad_fn=<MseLossBackward>) tensor(81., grad_fn=<MseLossBackward>)
weight values:,l21.weight.data,l22.weight.data,l23.weight.data,l24.weight.data
weight values: tensor([[0.1000050008],
        [0.1000050008]]) tensor([[0.1000500023],
        [0.1000500023]]) tensor([[0.1000090018],
        [0.1000090018]]) tensor([[0.1009000018],
        [0.1009000018]])

eqy · February 3, 2022, 11:13pm

This appears to be because the learning rate is set to a small value 0.00001 here.

With the following modification:

optimizer = torch.optim.SGD(list(l21.parameters())+list(l22.parameters())+list(l23.parameters())+list(l24.parameters()), lr=0.1, momentum=0)                                                                                                 l21.weight.data = torch.Tensor([[0.1],[0.1]])
l22.weight.data = torch.Tensor([[0.1],[0.1]])
l23.weight.data = torch.Tensor([[0.1],[0.1]])
l24.weight.data = torch.Tensor([[0.1],[0.1]])

I see

tensor([1.], requires_grad=True) tensor([10.], requires_grad=True) tensor([1.], requires_grad=True) tensor([10.], requires_grad=True)
input gradients:,a1.grad,a2.grad,a3.grad,a4.grad
input gradients: tensor([0.8999999762]) tensor([0.8999999762]) tensor([1.6200000048]) tensor([16.2000007629])
weight gradients:, l21.weight.grad, l22.weight.grad, l23.weight.grad, l24.weight.grad
weight gradients: tensor([[-0.5000000000],
        [-0.5000000000]]) tensor([[-5.],
        [-5.]]) tensor([[-0.8999999762],
        [-0.8999999762]]) tensor([[-90.],
        [-90.]])
loss values:,ls11,ls12,ls21,ls22
loss values: tensor(0.8999999762, grad_fn=<L1LossBackward0>) tensor(9., grad_fn=<L1LossBackward0>) tensor(0.8099999428, grad_fn=<MseLossBackward0>) tensor(81., grad_fn=<MseLossBackward0>)
weight values:,l21.weight.data,l22.weight.data,l23.weight.data,l24.weight.data
weight values: tensor([[0.1500000060],
        [0.1500000060]]) tensor([[0.6000000238],
        [0.6000000238]]) tensor([[0.1899999976],
        [0.1899999976]]) tensor([[9.1000003815],
        [9.1000003815]])
***********
a1, a2, a3, a4
tensor([1.], requires_grad=True) tensor([10.], requires_grad=True) tensor([1.], requires_grad=True) tensor([10.], requires_grad=True)
input gradients:,a1.grad,a2.grad,a3.grad,a4.grad
input gradients: tensor([0.8500000238]) tensor([0.3999999762]) tensor([1.3122000694]) tensor([1312.2000732422])
weight gradients:, l21.weight.grad, l22.weight.grad, l23.weight.grad, l24.weight.grad
weight gradients: tensor([[-0.5000000000],
        [-0.5000000000]]) tensor([[-5.],
        [-5.]]) tensor([[-0.8100000024],
        [-0.8100000024]]) tensor([[810.],
        [810.]])
loss values:,ls11,ls12,ls21,ls22
loss values: tensor(0.8500000238, grad_fn=<L1LossBackward0>) tensor(4., grad_fn=<L1LossBackward0>) tensor(0.6560999751, grad_fn=<MseLossBackward0>) tensor(6561., grad_fn=<MseLossBackward0>)
weight values:,l21.weight.data,l22.weight.data,l23.weight.data,l24.weight.data
weight values: tensor([[0.2000000030],
        [0.2000000030]]) tensor([[1.1000000238],
        [1.1000000238]]) tensor([[0.2709999979],
        [0.2709999979]]) tensor([[-71.9000015259],
        [-71.9000015259]])
***********
a1, a2, a3, a4
tensor([1.], requires_grad=True) tensor([10.], requires_grad=True) tensor([1.], requires_grad=True) tensor([10.], requires_grad=True)
input gradients:,a1.grad,a2.grad,a3.grad,a4.grad
input gradients: tensor([0.8000000119]) tensor([0.1000000238]) tensor([1.0628819466]) tensor([106288.2031250000])
weight gradients:, l21.weight.grad, l22.weight.grad, l23.weight.grad, l24.weight.grad
weight gradients: tensor([[-0.5000000000],
        [-0.5000000000]]) tensor([[5.],
        [5.]]) tensor([[-0.7289999723],
        [-0.7289999723]]) tensor([[-7290.],
        [-7290.]])
loss values:,ls11,ls12,ls21,ls22
loss values: tensor(0.8000000119, grad_fn=<L1LossBackward0>) tensor(1., grad_fn=<L1LossBackward0>) tensor(0.5314409733, grad_fn=<MseLossBackward0>) tensor(531441., grad_fn=<MseLossBackward0>)
weight values:,l21.weight.data,l22.weight.data,l23.weight.data,l24.weight.data
weight values: tensor([[0.2500000000],
        [0.2500000000]]) tensor([[0.6000000238],
        [0.6000000238]]) tensor([[0.3438999951],
        [0.3438999951]]) tensor([[657.1000366211],
        [657.1000366211]])
***********