# How gradients are applied in pytorch

I have a couple of questions regarding gradients and specifically how gradients are applied in pytorch.
I planned this example to understand gradient flow and how they are applied to the weights, and how they reach the input themselves.
I kept the chain rule in mind, and I wanted to look for them in the gradient values.
In this experiment, I simply want to reach from input to itself, via two objective functions, and I wanted to see the effect of L1 and L2, on values over 1 (in here 10).
I have only one linear layer, and each layer with one weight and no bias.
I set all the weights to 0.1, so 1 is the objective.

What I am seeing here is that in L2 loss, the difference is huge and that makes sense, and the gradient that is applied to the weights is also larger compared to other cases.

my questions:
We have the same input, but in the weights, however, we see the difference:

``````input gradients: tensor([0.9000]) tensor([0.9000]) tensor([1.6200]) tensor([16.2000])
[-0.5000]]) tensor([[-5.],
[-5.]]) tensor([[-0.9000],
[-0.9000]]) tensor([[-90.],
[-90.]])
``````

In the case of L1, since the numbers are equal, we have the same weights. I expect that in multiplication, the gradients should be the same, and in L1, we should have the same slope. We have the same gradients in the input, but for the weights, we have a 10 times difference.

``````
import torch
import torch.nn as nn
import random,numpy
# some are not necessary
def set_seed(seed):
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
numpy.random.seed(seed)
random.seed(seed)

set_seed(0)
torch.set_printoptions(precision=10)

l21=nn.Linear(1,1,bias=False)
l22=nn.Linear(1,1,bias=False)
l23=nn.Linear(1,1,bias=False)
l24=nn.Linear(1,1,bias=False)

optimizer = torch.optim.SGD(list(l21.parameters())+list(l22.parameters())+list(l23.parameters())+list(l24.parameters()), lr=0.00001, momentum=0)
l21.weight.data=torch.Tensor([[0.1]])
l22.weight.data=torch.Tensor([[0.1]])
l23.weight.data=torch.Tensor([[0.1]])
l24.weight.data=torch.Tensor([[0.1]])
n1=nn.L1Loss()
n2=nn.MSELoss()
a1=torch.Tensor([1])
a2=torch.Tensor([10])
a3=torch.Tensor([1])
a4=torch.Tensor([10])

=True

while(True):
ol1_1=l21(a1)
ol1_10 = l22(a2)
ol2_1 = l23(a3)
ol2_10 = l24(a4)

ls11=n1(ol1_1,a1)
ls12 = n1(ol1_10, a2)
ls21 = n2(ol2_1, a3)
ls22 = n2(ol2_10, a4)

ls11.backward()
ls12.backward()
ls21.backward()
ls22.backward()

print('a1, a2, a3, a4')
print(a1,a2,a3,a4)

optimizer.step()
print('loss values:,ls11,ls12,ls21,ls22')
print('loss values:',ls11,ls12,ls21,ls22)

print('weight values:,l21.weight.data,l22.weight.data,l23.weight.data,l24.weight.data')
print('weight values:',l21.weight.data,l22.weight.data,l23.weight.data,l24.weight.data)
print('***********')
``````

Are you intentionally setting all of the layers to use the same underlying weight tensor? I believe that is what is happening with the `l21.weight.data = ... = ... = ...` assignment, which might be giving you the unexpected results.

In other words, when this weight tensor is used/updated, this change is shared across all the â€śdifferentâ€ť layer definitions.

``````>>> import torch
>>> a = torch.nn.Linear(1,1,bias=False)
>>> b = torch.nn.Linear(1,1,bias=False)
>>> c = torch.nn.Linear(1,1,bias=False)
>>> a.weight.data = b.weight.data = c.weight.data = torch.tensor([[0.1],[0.1]])
>>> a.weight.data += 1
>>> a.weight.data
tensor([[1.1000],
[1.1000]])
>>> b.weight.data
tensor([[1.1000],
[1.1000]])
>>> c.weight.data
tensor([[1.1000],
[1.1000]])
>>>
``````

You might want to separately assign new `torch.Tensor(...)`s to the weights if you wish to avoid this.

1 Like

Thanks. I didnâ€™t notice that, but the problem still is here:

``````a1, a2, a3, a4
input gradients: tensor([0.8999999762]) tensor([0.8999999762]) tensor([1.6200000048]) tensor([16.2000007629])
[-0.5000000000]]) tensor([[-5.],
[-5.]]) tensor([[-0.8999999762],
[-0.8999999762]]) tensor([[-90.],
[-90.]])
loss values:,ls11,ls12,ls21,ls22
weight values:,l21.weight.data,l22.weight.data,l23.weight.data,l24.weight.data
weight values: tensor([[0.1000050008],
[0.1000050008]]) tensor([[0.1000500023],
[0.1000500023]]) tensor([[0.1000090018],
[0.1000090018]]) tensor([[0.1009000018],
[0.1009000018]])
``````

This appears to be because the learning rate is set to a small value `0.00001` here.

With the following modification:

``````optimizer = torch.optim.SGD(list(l21.parameters())+list(l22.parameters())+list(l23.parameters())+list(l24.parameters()), lr=0.1, momentum=0)                                                                                                 l21.weight.data = torch.Tensor([[0.1],[0.1]])
l22.weight.data = torch.Tensor([[0.1],[0.1]])
l23.weight.data = torch.Tensor([[0.1],[0.1]])
l24.weight.data = torch.Tensor([[0.1],[0.1]])
``````

I see

``````tensor([1.], requires_grad=True) tensor([10.], requires_grad=True) tensor([1.], requires_grad=True) tensor([10.], requires_grad=True)
input gradients: tensor([0.8999999762]) tensor([0.8999999762]) tensor([1.6200000048]) tensor([16.2000007629])
[-0.5000000000]]) tensor([[-5.],
[-5.]]) tensor([[-0.8999999762],
[-0.8999999762]]) tensor([[-90.],
[-90.]])
loss values:,ls11,ls12,ls21,ls22
weight values:,l21.weight.data,l22.weight.data,l23.weight.data,l24.weight.data
weight values: tensor([[0.1500000060],
[0.1500000060]]) tensor([[0.6000000238],
[0.6000000238]]) tensor([[0.1899999976],
[0.1899999976]]) tensor([[9.1000003815],
[9.1000003815]])
***********
a1, a2, a3, a4
input gradients: tensor([0.8500000238]) tensor([0.3999999762]) tensor([1.3122000694]) tensor([1312.2000732422])
[-0.5000000000]]) tensor([[-5.],
[-5.]]) tensor([[-0.8100000024],
[-0.8100000024]]) tensor([[810.],
[810.]])
loss values:,ls11,ls12,ls21,ls22
weight values:,l21.weight.data,l22.weight.data,l23.weight.data,l24.weight.data
weight values: tensor([[0.2000000030],
[0.2000000030]]) tensor([[1.1000000238],
[1.1000000238]]) tensor([[0.2709999979],
[0.2709999979]]) tensor([[-71.9000015259],
[-71.9000015259]])
***********
a1, a2, a3, a4
input gradients: tensor([0.8000000119]) tensor([0.1000000238]) tensor([1.0628819466]) tensor([106288.2031250000])
[-0.5000000000]]) tensor([[5.],
[5.]]) tensor([[-0.7289999723],
[-0.7289999723]]) tensor([[-7290.],
[-7290.]])
loss values:,ls11,ls12,ls21,ls22