I would like to understand why the autograd.grad gives the sum of gradients instead of the tuple of each gradient of output[i] wrt. inputs. I am aware that there are ways to go that by using a for loop or is_grads_batched but the loop is very time inefficient while the vectorized approach is quite memory intense. Therefore, I would like to understand better what happens under the hood in autograd.grad and why it has to give the sum of gradients instead of a tuple including all respective derivatives.
Hi,
What is the exact call to autograd.grad that you’re doing?
If your output is not a scalar, I guess you’re passing it a vector of grad_outputs that is all 1s? The backwprop algorithm is computing v^T J
where v
is the grad_outputs you passed and J
is the jacobian matrix of the network. So depending on the value of v
you passed in, then yes it might sum the gradients.
@albanD let me elaborate a bit on what my goal is and show you an example.
Here some code that explains a bit what I mean:
import torch
def scalar_function(vector):
# Example scalar function: Dot product of the vector with its transpose
return torch.sum(torch.matmul(vector.unsqueeze(1), vector.unsqueeze(0)))
# Let's create a sample vector
input_vector = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
# Compute the output of the scalar function
output_scalar = scalar_function(input_vector)
print("Output Scalar:\n", output_scalar)
# Compute the gradient of the scalar function with respect to the input vector
gradient = torch.autograd.grad(output_scalar, input_vector, create_graph=True)[0]
# Compute the second derivative of the scalar function with respect to the input vector
n = len(input_vector)
second_derivative = torch.zeros(n, n)
for i in range(n):
second_derivative[:, i] = torch.autograd.grad(gradient[i], input_vector, retain_graph=True)[0]
print("Gradient:\n", gradient)
print("Second Derivative:\n", second_derivative)
# Compute the derivative of the scalar function with respect to the input vector
gradient = torch.autograd.grad(output_scalar, input_vector, create_graph=True)
print("Gradient:", gradient[0])
# Compute the second derivative of the scalar function with respect to the input vector
output = torch.ones((gradient[0].shape[0],gradient[0].shape[0]))
print(output)
output=torch.eye(3,3)
second_derivative = torch.autograd.grad(outputs = gradient, inputs = input_vector,grad_outputs=(output), create_graph=True,is_grads_batched=True, retain_graph=True)
print(" Correct Second Derivative:", second_derivative[0])
sum_second_derivative = torch.autograd.grad(outputs = gradient, inputs = input_vector,grad_outputs=(torch.Tensor([1,1,1])), create_graph=True) # with grad_outputs= (output) error :
# Mismatch in shape: grad_output[0] has a shape of torch.Size([3, 3]) and output[0] has a shape of torch.Size([3]).
print("Column summed Second Derivative:", sum_second_derivative[0])
The last two prints are:
Correct Second Derivative: tensor([[2., 2., 2.],
[2., 2., 2.],
[2., 2., 2.]])
Column summed Second Derivative: tensor([6., 6., 6.])
The loop gives the same output as the is_grads_bachted approach. Now i would like to understand why it is necessary to do it that way, while with is_grads_bachted=True it does some kind of summation.
If I understand your comment correct, I the the jacobian of the d out[i]/d in[j] and then do the matrix multiplication with the grad_outputs = vector. Is there then a way to just get the jacobian by autograd.grad?
You have to do multiple backwards to compute the full jacobian (that’s a limitation of AD in general). And that’s why you need to pass is_grads_batched indeed that allows you to do multiple backwards in one go.