Is autograd.grad producing a wrong grad for l1 and l2(linear layers in example)? (sorry, edited post, I pasted wrong initially)

Here is the code that I am running and the corresponding output

l1 = t.nn.Linear(1, 1, bias=False)
l1.weight.data[:] = 2
l2 = t.nn.Linear(1, 1, bias=False)
l2.weight.data[:] = 3

inputs = t.tensor([5.], requires_grad=True)
a = l2(l1(inputs))
g = t.autograd.grad(outputs=a, inputs=inputs, create_graph=True)[0]
print('inputs.grad is ', g)
gp = g * 3
gp.backward()
print('l1_grad is now ', l1.weight.grad)
print('l2_grad is now ', l2.weight.grad)



inputs.grad is  tensor([6.], grad_fn=<SqueezeBackward1>)
l1_grad is now  tensor([[9.]])
l2_grad is now  tensor([[6.]])

Now, to my understanding, the gradients should actually be
45 for l1_grad
30 for l2_grad

however, it seems that the inputs is not taken into account when computing the gradients for l1 and l2?

Hi,

So what you do here is the following.
I use l1 for l1.weight and l2 for l2.weight for readability.

a = l1*(l2*inputs)
grad wrt a:
inputs.grad = 1 * l1 * l2
gp = 3 * inputs.grad = 3 * l1 * l2

grad wrt gp now
dgp/dl1 = 3*l2 = 9
dgp/dl2 = 3*l1 = 6

So I would agree with the autograd result.
Could you explain what is your logic to get 45 and 30?

1 Like

Ah, sorry. Your line

gp = 3 * inputs.grad = 3 * l1 * l2

made me realize my mistake. When taking the gradient of gp wrt l1, I was thinking that the inputs was still attached to the graph so something like
gp = 3 * l1 * l1 * inputs which is wrong

Thank you!

1 Like