Get loss, gradient and hessian in one go

qq-me · April 29, 2025, 4:53pm

Currently I do this:

from collections.abc import Sequence
import torch

def jacobian(input: Sequence[torch.Tensor], wrt: Sequence[torch.Tensor], create_graph=False):
    flat_input = torch.cat([i.reshape(-1) for i in input])
    return torch.autograd.grad(
        flat_input,
        wrt,
        torch.eye(len(flat_input), device=input[0].device, dtype=input[0].dtype),
        retain_graph=True,
        create_graph=create_graph,
        allow_unused=True,
        is_grads_batched=True,
    )

input = torch.randn(3, requires_grad=True)
loss = (input**2).mean()
grad = torch.autograd.grad(loss, input, create_graph=True)
hessian = jacobian(grad, [input])

However as I understand the most efficient way of computing hessian is torch.func.hessian — PyTorch 2.7 documentation

But my trouble is that when I use it, I do this:

def fn(x): return x.pow(2).mean()
loss = fn(input)
grad = torch.autograd.grad(loss, input)
hessian = torch.func.hessian(fn)(input)

Doesn’t hessian then need to re-evaluate the function value and the gradients again even though I have already computed them? My function might be quite expensive. What would be the best solution to calculating hessian?

KFrank · April 30, 2025, 4:12pm

Hi Ivan!

qq-me:

But my trouble is that when I use it, I do this:
def fn(x): return x.pow(2).mean()
loss = fn(input)
grad = torch.autograd.grad(loss, input)
hessian = torch.func.hessian(fn)(input)
Doesn’t hessian then need to re-evaluate the function value and the gradients again even though I have already computed them?

I believe that you are correct about this. I am not aware of any pre-packaged pytorch
hessian functionality that also gives you access to the loss and grad it would have
computed under the hood.

So I do believe that func.hessian() does require duplicative computation in your use
case.

On the other hand, if your grad consists of n components, computing the hessian
(e.g., with func.hessian()) requires n autograd passes (in addition to the first autograd
pass used to compute grad). So the duplicative computation may well be relatively
insignificant.

It is also conceivable that func.hessian() contains some minor internal efficiencies
that are sufficient to overcome the cost of the unnecessary computation.

Note that if your function is expensive, it is also likely the first-derivative backward pass
for grad and the second-derivative backward passes for hessian will be expensive.
So, again, the cost of the n second-derivative backward passes may dominate the
overall cost, with cost of the redundant computation of loss and grad being relatively
minor.

If the cost of hessian is important to your use case, it would probably make sense to
time both approaches.

Best.

K. Frank.