Computing the Hessian matrix -- correctness and performance

mengxixi · August 17, 2019, 8:13pm

Hello,

I’ve been trying to compute the Hessian of my loss function with respect to the parameters of my network. The code I use below is inspired from this discussion,

def compute_hessian(grads, params):
    H = []
    for i, (grad, p) in enumerate(zip(grads, params)):
        grad = grad.reshape(-1)
        d = len(grad)
        dg = torch.zeros((d, d))

        for j, g in enumerate(grad):
            g2 = autograd.grad(g, p, create_graph=True)[0].view(-1)
            dg[j] = g2

        H.append(dg)

    return H

where params is the list of parameters of my network and grads is computed manually rather than using backward, i.e.

grads = autograd.grad(loss, self.params, create_graph=True)

My first question is, by iterating through the list of parameters, I am essentially computing a sub-Hessian for each parameter, and therefore my overall Hessian is block diagonal rather than being the full Hessian. While I do realize that I can concatenate all the parameters of my model into one giant vector then compute the full Hessian, when I update the parameters I need to break them apart and re-distribute to the list of parameters which seems to be a hassle. Am I making a mistake somewhere or is this how it’s suppose to work?

My second question is, the for-loop above used to take the grad of each gradient seem to be slow. Is there any way to speed it up or add more parallelization to it?

Thanks,

Yaroslav_Bulatov · August 25, 2019, 12:57pm

Yes, this is how it’s supposed to work. However, for any useful network, full-Hessian is going to be too large to fit in memory. Block-diagonal is a common approximation because off-diagonal blocks may approach rank-1 matrices for deep network, hence not too useful.

You can save some time by computing block Hessians for all layers from same set of backward values, here’s an example – https://gist.github.com/yaroslavvb/6b08eb8f683b646785d39e97679fdb2e