Very Slow to Compute Diagonal Hessian

I am trying to compute only the diagonal of a hessian matrix. My codes look like this:

# param is a Tensor with size (n, n), dissemble it so that we can take grad w.r.t. each individual entry
param = [[param[i][j] for i in range(n)] for j in range(n)] 
for p_ in param:
    for p in p_:
new_param = torch.stack([torch.stack(p) for p in param]) # reassemble
loss = model(inputs, param=new_param)
first_derivative = torch.autograd.grad(loss, new_param, create_graph=True)[0]
for i in range(n):
    for j in range(n):
        second_drivative[i][j] = torch.autograd.grad(first_derivative[i][j], param[i][j], retain_graph=True)[0]

It succeeded, but takes very long time (almost as long as computing the full Hessian). The problem seems to be these for loops, and it having to back propagate through the same graph n^2 times

Is there a way to speed it up? Or what changes should I make to PyTorch to facilitate it?

Many thanks.

It makes sense that it takes as much time as the full hessian no? You evaluate each row of the hessian and only take the one element you want from it.
Note that you don’t do it explicitly, but that’s what torch.autograd.grad(first_derivative[i][j], inp) is doing as it would be the same as

mask = torch.zeros_like(first_derivative)
mask[i][j] = 1
torch.autograd.grad(first_derivative, inp, grad_output=mask)

Yes, that makes sense.

Is there a way to somehow modify the way backprop works in PyTorch so that it computes the second derivative instead of the first? I know it might be a lot of work, but I feel it’s worth it since quite a few methods use diagonal hessian for pruning, e.g. optimal brain damage.


Unfortunately, automatic differentiation (which is at the core of pytorch’s backprop) cannot be used to do that. You have to do exactly what you do here.


That’s unfortunate. Thanks anyway!