Hello,
I’ve been trying to compute the Hessian of my loss function with respect to the parameters of my network. The code I use below is inspired from this discussion,
def compute_hessian(grads, params):
H = []
for i, (grad, p) in enumerate(zip(grads, params)):
grad = grad.reshape(-1)
d = len(grad)
dg = torch.zeros((d, d))
for j, g in enumerate(grad):
g2 = autograd.grad(g, p, create_graph=True)[0].view(-1)
dg[j] = g2
H.append(dg)
return H
where params
is the list of parameters of my network and grads
is computed manually rather than using backward
, i.e.
grads = autograd.grad(loss, self.params, create_graph=True)
My first question is, by iterating through the list of parameters, I am essentially computing a sub-Hessian for each parameter, and therefore my overall Hessian is block diagonal rather than being the full Hessian. While I do realize that I can concatenate all the parameters of my model into one giant vector then compute the full Hessian, when I update the parameters I need to break them apart and re-distribute to the list of parameters which seems to be a hassle. Am I making a mistake somewhere or is this how it’s suppose to work?
My second question is, the for-loop above used to take the grad of each gradient seem to be slow. Is there any way to speed it up or add more parallelization to it?
Thanks,