Backward result in pytorch

I am trying to calculate Jacobian matrix of high-dimension tensor using backward function in Pytorch.
I have my test code below:

    t1 = torch.tensor([1.], requires_grad=True)
    t2 = torch.tensor([
        [0.1, 0.2, 0.3],
        ], requires_grad=True)

    tr = t1 + t2
    tr.backward(torch.tensor([
        [1., 0., 0.],
    ]))

    print(t1.grad.data)
    print("=======================")
    print(t2.grad.data)

And the result is here:

tensor([1.])
=======================
tensor([[1., 0., 0.]])

Why the shape of dtr/dt1 is not (3,1)? Why its shape is (1)?

t1.grad will have the shape same as t1. That’s true for any tensor in general, its grad’s shape will be same as its own shape.

And gradient will be sumed to grad of leaf node?
For example now t2’s shape is (2,3), and grad of t1 become [2.]

    t1 = torch.tensor([1.], requires_grad=True)
    t2 = torch.tensor([
        [0.1, 0.2, 0.3],
        [0.4, 0.5, 0.6]
        ], requires_grad=True)

    tr = t1 + t2
    tr.backward(torch.tensor([
        [1., 0., 0.],
        [1., 0., 0.]
    ]))

    print(t1.grad.data)
    print("=======================")
    print(t2.grad.data)
tensor([2.])
=======================
tensor([[1., 0., 0.],
        [1., 0., 0.]])

Hi, yes.
Unless you use zero_grad, gradients in PyTorch are accumulated by default.

1 Like

Did you mean interface like optimizer.zero_grad() or t.grad().zero_()? But What if my backward is an one pass operation?
I mean for code below, it’s better that the gradient of t1 is ([1.]), though t1 actually contributes to two
independent elements in t2. Because in many cases, different index of tensor are independent or parallel for calculation.

    t1 = torch.tensor([1.], requires_grad=True)
    t2 = torch.tensor([
        [0.1, 0.2, 0.3],
        [0.4, 0.5, 0.6]
        ], requires_grad=True)

    tr = t1 + t2
    tr.backward(torch.tensor([
        [1., 0., 0.],
        [1., 0., 0.]
    ]))

    print(t1.grad.data)
    print("=======================")
    print(t2.grad.data)

Both t1.grad.zero_() and optimizer.zero_grad() will zero out the gradient.
If you don’t call any of those then t1.grad would accumulate over backward passes. If you have just 1 backward pass then t1.grad would be just the gradient for that 1 backward pass. If you call backward again (without zeroing out the grad) then t1.grad would be the old value plus the new value and so on.