Computing the Hessian matrix -- correctness and performance


I’ve been trying to compute the Hessian of my loss function with respect to the parameters of my network. The code I use below is inspired from this discussion,

def compute_hessian(grads, params):
    H = []
    for i, (grad, p) in enumerate(zip(grads, params)):
        grad = grad.reshape(-1)
        d = len(grad)
        dg = torch.zeros((d, d))

        for j, g in enumerate(grad):
            g2 = autograd.grad(g, p, create_graph=True)[0].view(-1)
            dg[j] = g2


    return H

where params is the list of parameters of my network and grads is computed manually rather than using backward, i.e.

grads = autograd.grad(loss, self.params, create_graph=True)

My first question is, by iterating through the list of parameters, I am essentially computing a sub-Hessian for each parameter, and therefore my overall Hessian is block diagonal rather than being the full Hessian. While I do realize that I can concatenate all the parameters of my model into one giant vector then compute the full Hessian, when I update the parameters I need to break them apart and re-distribute to the list of parameters which seems to be a hassle. Am I making a mistake somewhere or is this how it’s suppose to work?

My second question is, the for-loop above used to take the grad of each gradient seem to be slow. Is there any way to speed it up or add more parallelization to it?


Yes, this is how it’s supposed to work. However, for any useful network, full-Hessian is going to be too large to fit in memory. Block-diagonal is a common approximation because off-diagonal blocks may approach rank-1 matrices for deep network, hence not too useful.

You can save some time by computing block Hessians for all layers from same set of backward values, here’s an example –