I’ve been trying to compute the Hessian of my loss function with respect to the parameters of my network. The code I use below is inspired from this discussion,
def compute_hessian(grads, params): H =  for i, (grad, p) in enumerate(zip(grads, params)): grad = grad.reshape(-1) d = len(grad) dg = torch.zeros((d, d)) for j, g in enumerate(grad): g2 = autograd.grad(g, p, create_graph=True).view(-1) dg[j] = g2 H.append(dg) return H
params is the list of parameters of my network and
grads is computed manually rather than using
grads = autograd.grad(loss, self.params, create_graph=True)
My first question is, by iterating through the list of parameters, I am essentially computing a sub-Hessian for each parameter, and therefore my overall Hessian is block diagonal rather than being the full Hessian. While I do realize that I can concatenate all the parameters of my model into one giant vector then compute the full Hessian, when I update the parameters I need to break them apart and re-distribute to the list of parameters which seems to be a hassle. Am I making a mistake somewhere or is this how it’s suppose to work?
My second question is, the for-loop above used to take the grad of each gradient seem to be slow. Is there any way to speed it up or add more parallelization to it?