Compute the Hessian matrix of a network

ToastCheng · March 21, 2018, 3:55pm

Hi, I am trying to compute Hessian matrix by calling twice autograd.grad() on a variable.
It works fine in a toy example:

a = torch.FloatTensor([1])
b = torch.FloatTensor([3])

a, b = Variable(a, requires_grad=True), Variable(b, requires_grad=True)

c = a + 3 * b**2
c = c.sum()

grad_b = torch.autograd.grad(c, b, create_graph=True)
grad2_b = torch.autograd.grad(grad_b, b, create_graph=True)

print(grad2_b)

Output:

Variable containing:
 6
[torch.FloatTensor of size 1]

But here is the question, I want to compute the Hessian of a network,
so I define a function:

def calculate_hessian(loss, model):
    var = model.parameters()
    temp = []
    grads = torch.autograd.grad(loss, var, create_graph=True)[0]
    grads = torch.cat([g.view(-1) for g in grads])
    
    for grad in grads:
        grad2 = torch.autograd.grad(grad, var, create_graph=True)
        temp.append(grad2)
    return np.array(temp)

It returns an empty list []. Seems like the gradient of grad cannot be computed.
Any help?

Thanks,

Wei_Deng · March 22, 2018, 1:59pm

Waiting for the same solution.

paul_c · July 4, 2018, 2:21pm

I use follow code to evaluate Hessian matrix:

# eval Hessian matrix
def eval_hessian(loss_grad, model):
    cnt = 0
    for g in loss_grad:
        g_vector = g.contiguous().view(-1) if cnt == 0 else torch.cat([g_vector, g.contiguous().view(-1)])
        cnt = 1
    l = g_vector.size(0)
    hessian = torch.zeros(l, l)
    for idx in range(l):
        grad2rd = autograd.grad(g_vector[idx], model.parameters(), create_graph=True)
        cnt = 0
        for g in grad2rd:
            g2 = g.contiguous().view(-1) if cnt == 0 else torch.cat([g2, g.contiguous().view(-1)])
            cnt = 1
        hessian[idx] = g2
    return hessian.cpu().data.numpy()

where loss_grad can calculate like: autograd.grad(loss, net.parameters(), create_graph=True)

Note: it’s only for small network

JIANG_GUOQING · December 9, 2018, 8:38am

I’ve solved this problem. Hope below will also help your problem.

github.com

Ageliss/For_shared_codes/blob/master/Second_order_gradients.py

# Author: Guo-qing Jiang (jianggq@mit.edu)
# Pytorch second oder gradient calculation for the diagonal of Hessian matrix 
# feel free to copy
import torch


def get_second_order_grad(grads, xs):
    start = time.time()
    grads2 = []
    for j, (grad, x) in enumerate(zip(grads, xs)):
        print('2nd order on ', j, 'th layer')
        print(x.size())
        grad = torch.reshape(grad, [-1])
        grads2_tmp = []
        for count, g in enumerate(grad):
            g2 = torch.autograd.grad(g, x, retain_graph=True)[0]
            g2 = torch.reshape(g2, [-1])
            grads2_tmp.append(g2[count].data.cpu().numpy())
        grads2.append(torch.from_numpy(np.reshape(grads2_tmp, x.size())).to(DEVICE_IDS[0]))
        print('Time used is ', time.time() - start)

This file has been truncated. show original

michaelklachko · January 12, 2019, 10:35pm

Hi @JIANG_GUOQING , looking at your code, in the line

g2 = torch.autograd.grad(g, x, retain_graph=True)[0]

g is an individual gradient, but x is a vector of weights.
Is that intended? Don’t you want to calculate individual second order gradients for each individual weight?

However, if I add x = torch.reshape(x, [-1]) line I get

>     g2 = torch.autograd.grad(g, x, retain_graph=True)[0]
> RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

Anyone knows what this means?

EDIT: Oh I see that you pick g2[count] afterwards to get the diagonal of the hessian, but I’m still confused why I can’t calculate a gradient of a scalar in respect to a scalar.

John1231983 · January 13, 2019, 4:45am

Interesting! Anyone can answer the purpose of hessian in cnn. Thanks

michaelklachko · January 13, 2019, 5:07am

I’m using it to penalize growth of second order gradients, so basically the same reason why anyone would penalize first order gradients. In my case it’s to improve noise robustness.

John1231983 · January 13, 2019, 5:14am

So, it will be integrated in loss function or working as a layer of cnn? Do you have any reference paper using hessian to improve accuracy? I am also interested in classification/detection/segmentation

michaelklachko · January 13, 2019, 5:31am

What do you mean “working as a layer”?

However it’s extremely slow, and I’m about to give up trying to implement it efficiently.

JIANG_GUOQING · March 27, 2019, 5:53am

No, you don’t need to reshape x. Since the graph was memorized in the previous gradient calculation, you should not change x if you wanna calculate high order gradient.

Hope it makes sense to you.

Michael Klachko noreply@discuss.pytorch.org 于2019年1月13日周日上午6:45写道：

michaelklachko · March 27, 2019, 7:19pm

Did you use this code during training? I guess not, because switching to numpy would break autodiff…

JIANG_GUOQING · March 29, 2019, 7:26am

Yes, you are right. But actually I turned on the training mode because I needed to calculate the high order of ‘Gradient’.

Michael Klachko via PyTorch Forums noreply@discuss.pytorch.org 于2019年3月28日周四上午3:29写道：

nima_rafiee · May 23, 2019, 9:34pm

I think currently it’s only possible to calculate Hessian w.r.t input x not the parameters of the network.

tengerye · October 14, 2019, 7:12am

I have checked with solutions of @JIANG_GUOQING and @paul_c, overall, they are slow. The biggest problem stuck with the autograd.grad() can only work on single output. Does anyone has faster solution?

albanD · October 14, 2019, 2:11pm

Hi,

Unfortunately, these operations are expected to be relatively slow. You can find here a simple way to get Jacobian and Hessian matrices.

tengerye · October 15, 2019, 2:28am

Thank you for your kind reply. Unfortunately, the linked code is even slightly slower.

Yaroslav_Bulatov · October 15, 2019, 3:47am

@tengerye Would the Gauss-Newton matrix work for you? It’s equal to the Hessian for Linear and ReLU networks. For other networks it can be viewed as a positive semidefinite approximation to the Hessian (ie, http://andrew.gibiansky.com/blog/machine-learning/gauss-newton-matrix/)

Generic Hessian is expensive even for a theoretically optimal algorithm while Gauss-Newton is cheap.

tengerye · October 22, 2019, 3:05am

@Yaroslav_Bulatov Sorry for the reply. I believe that is exactly I wanted. If possible, is there any tutorial with pytorch code or code demos that you may know? Thank you so much.

Yaroslav_Bulatov · October 22, 2019, 10:33pm

@tengerye even though Gauss-Newton is cheap to compute, the matrix will typically need too much storage to store explicitly, so you’ll additionally need some kind of structured approximation. Here’s an example of computing diagonal and KFAC approximations of Gauss-Newton for linear layers – https://github.com/cybertronai/autograd-lib#autograd_lib

kkudus · October 25, 2019, 7:15pm

Im looking to efficiently compute the hessian of my loss function with respect to my inputs (only inputs, not weights). Is this a suitable solution? Having some trouble understanding it.