Why does the autograd.grad return the sum of gradients

Nilsgoe · February 26, 2024, 2:36pm

I would like to understand why the autograd.grad gives the sum of gradients instead of the tuple of each gradient of output[i] wrt. inputs. I am aware that there are ways to go that by using a for loop or is_grads_batched but the loop is very time inefficient while the vectorized approach is quite memory intense. Therefore, I would like to understand better what happens under the hood in autograd.grad and why it has to give the sum of gradients instead of a tuple including all respective derivatives.

albanD · February 26, 2024, 7:34pm

Hi,

What is the exact call to autograd.grad that you’re doing?
If your output is not a scalar, I guess you’re passing it a vector of grad_outputs that is all 1s? The backwprop algorithm is computing v^T J where v is the grad_outputs you passed and J is the jacobian matrix of the network. So depending on the value of v you passed in, then yes it might sum the gradients.

Nilsgoe · February 27, 2024, 8:50am

@albanD let me elaborate a bit on what my goal is and show you an example.
Here some code that explains a bit what I mean:

import torch

def scalar_function(vector):
    # Example scalar function: Dot product of the vector with its transpose
    return torch.sum(torch.matmul(vector.unsqueeze(1), vector.unsqueeze(0)))

# Let's create a sample vector
input_vector = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Compute the output of the scalar function
output_scalar = scalar_function(input_vector)
print("Output Scalar:\n", output_scalar)
# Compute the gradient of the scalar function with respect to the input vector
gradient = torch.autograd.grad(output_scalar, input_vector, create_graph=True)[0]

# Compute the second derivative of the scalar function with respect to the input vector
n = len(input_vector)
second_derivative = torch.zeros(n, n)
for i in range(n):
    second_derivative[:, i] = torch.autograd.grad(gradient[i], input_vector, retain_graph=True)[0]


print("Gradient:\n", gradient)
print("Second Derivative:\n", second_derivative)


# Compute the derivative of the scalar function with respect to the input vector
gradient = torch.autograd.grad(output_scalar, input_vector, create_graph=True)
print("Gradient:", gradient[0])

# Compute the second derivative of the scalar function with respect to the input vector
output = torch.ones((gradient[0].shape[0],gradient[0].shape[0]))
print(output)
output=torch.eye(3,3)
second_derivative = torch.autograd.grad(outputs = gradient, inputs = input_vector,grad_outputs=(output), create_graph=True,is_grads_batched=True, retain_graph=True)



print(" Correct Second Derivative:", second_derivative[0])

sum_second_derivative = torch.autograd.grad(outputs = gradient, inputs = input_vector,grad_outputs=(torch.Tensor([1,1,1])), create_graph=True) # with grad_outputs= (output) error : 
# Mismatch in shape: grad_output[0] has a shape of torch.Size([3, 3]) and output[0] has a shape of torch.Size([3]).


print("Column summed Second Derivative:", sum_second_derivative[0])

The last two prints are:
Correct Second Derivative: tensor([[2., 2., 2.],
[2., 2., 2.],
[2., 2., 2.]])
Column summed Second Derivative: tensor([6., 6., 6.])

The loop gives the same output as the is_grads_bachted approach. Now i would like to understand why it is necessary to do it that way, while with is_grads_bachted=True it does some kind of summation.
If I understand your comment correct, I the the jacobian of the d out[i]/d in[j] and then do the matrix multiplication with the grad_outputs = vector. Is there then a way to just get the jacobian by autograd.grad?

albanD · March 4, 2024, 3:05pm

You have to do multiple backwards to compute the full jacobian (that’s a limitation of AD in general). And that’s why you need to pass is_grads_batched indeed that allows you to do multiple backwards in one go.