Compute the Hessian matrix of a network

Hi, I am trying to compute Hessian matrix by calling twice autograd.grad() on a variable.
It works fine in a toy example:

a = torch.FloatTensor([1])
b = torch.FloatTensor([3])

a, b = Variable(a, requires_grad=True), Variable(b, requires_grad=True)

c = a + 3 * b**2
c = c.sum()

grad_b = torch.autograd.grad(c, b, create_graph=True)
grad2_b = torch.autograd.grad(grad_b, b, create_graph=True)



Variable containing:
[torch.FloatTensor of size 1]

But here is the question, I want to compute the Hessian of a network,
so I define a function:

def calculate_hessian(loss, model):
    var = model.parameters()
    temp = []
    grads = torch.autograd.grad(loss, var, create_graph=True)[0]
    grads =[g.view(-1) for g in grads])
    for grad in grads:
        grad2 = torch.autograd.grad(grad, var, create_graph=True)
    return np.array(temp)

It returns an empty list []. Seems like the gradient of grad cannot be computed.
Any help?



Waiting for the same solution.

1 Like

I use follow code to evaluate Hessian matrix:

# eval Hessian matrix
def eval_hessian(loss_grad, model):
    cnt = 0
    for g in loss_grad:
        g_vector = g.contiguous().view(-1) if cnt == 0 else[g_vector, g.contiguous().view(-1)])
        cnt = 1
    l = g_vector.size(0)
    hessian = torch.zeros(l, l)
    for idx in range(l):
        grad2rd = autograd.grad(g_vector[idx], model.parameters(), create_graph=True)
        cnt = 0
        for g in grad2rd:
            g2 = g.contiguous().view(-1) if cnt == 0 else[g2, g.contiguous().view(-1)])
            cnt = 1
        hessian[idx] = g2
    return hessian.cpu().data.numpy()

where loss_grad can calculate like: autograd.grad(loss, net.parameters(), create_graph=True)

Note: it’s only for small network

I’ve solved this problem. Hope below will also help your problem.

1 Like

Hi @JIANG_GUOQING , looking at your code, in the line

g2 = torch.autograd.grad(g, x, retain_graph=True)[0]

g is an individual gradient, but x is a vector of weights.
Is that intended? Don’t you want to calculate individual second order gradients for each individual weight?

However, if I add x = torch.reshape(x, [-1]) line I get

>     g2 = torch.autograd.grad(g, x, retain_graph=True)[0]
> RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

Anyone knows what this means?

EDIT: Oh I see that you pick g2[count] afterwards to get the diagonal of the hessian, but I’m still confused why I can’t calculate a gradient of a scalar in respect to a scalar.

Interesting! Anyone can answer the purpose of hessian in cnn. Thanks

I’m using it to penalize growth of second order gradients, so basically the same reason why anyone would penalize first order gradients. In my case it’s to improve noise robustness.


So, it will be integrated in loss function or working as a layer of cnn? Do you have any reference paper using hessian to improve accuracy? I am also interested in classification/detection/segmentation

What do you mean “working as a layer”?

However it’s extremely slow, and I’m about to give up trying to implement it efficiently.

1 Like

No, you don’t need to reshape x. Since the graph was memorized in the previous gradient calculation, you should not change x if you wanna calculate high order gradient.

Hope it makes sense to you.

Michael Klachko 于2019年1月13日周日 上午6:45写道:

Did you use this code during training? I guess not, because switching to numpy would break autodiff…

Yes, you are right. But actually I turned on the training mode because I needed to calculate the high order of ‘Gradient’.

Michael Klachko via PyTorch Forums 于2019年3月28日周四 上午3:29写道:

I think currently it’s only possible to calculate Hessian w.r.t input x not the parameters of the network.

1 Like

I have checked with solutions of @JIANG_GUOQING and @paul_c, overall, they are slow. The biggest problem stuck with the autograd.grad() can only work on single output. Does anyone has faster solution?


Unfortunately, these operations are expected to be relatively slow. You can find here a simple way to get Jacobian and Hessian matrices.

Thank you for your kind reply. Unfortunately, the linked code is even slightly slower.

@tengerye Would the Gauss-Newton matrix work for you? It’s equal to the Hessian for Linear and ReLU networks. For other networks it can be viewed as a positive semidefinite approximation to the Hessian (ie,

Generic Hessian is expensive even for a theoretically optimal algorithm while Gauss-Newton is cheap.

1 Like

@Yaroslav_Bulatov Sorry for the reply. I believe that is exactly I wanted. If possible, is there any tutorial with pytorch code or code demos that you may know? Thank you so much.

@tengerye even though Gauss-Newton is cheap to compute, the matrix will typically need too much storage to store explicitly, so you’ll additionally need some kind of structured approximation. Here’s an example of computing diagonal and KFAC approximations of Gauss-Newton for linear layers –

Im looking to efficiently compute the hessian of my loss function with respect to my inputs (only inputs, not weights). Is this a suitable solution? Having some trouble understanding it.