I’m looking at an implementation for calculating the Hessian matrix of the loss function.
loss = self.loss_function()
loss.backward(retain_graph=True)
grad_params = torch.autograd.grad(loss, p, create_graph=True) # p is the weight matrix for a particular layer
hess_params = torch.zeros_like(grad_params[0])
for i in range(grad_params[0].size(0)):
for j in range(grad_params[0].size(1)):
hess_params[i, j] = torch.autograd.grad(grad_params[0][i][j], p, retain_graph=True)[0][i, j]
I had 3 questions:
Why do we compute hessian in a loop?Can’t we use something in the lines of
There’s a fundamental problem of Hessian being large. Take resnet-50 which has 25M parameters, Hessian then has 625 trillion entries. This means for large networks you have to deal with factorized approximations or consider a subset of the entries like the diagonal, which can be obtained at a similar cost to the gradient.
IE, for ReLU networks, you can get diagonal exactly using Gauss-Newton trick implemented here and for more general networks you can use Hutchison estimator through Hessian-vector products like in this colab
@Yaroslav_Bulatov and others helped me formulate more efficient Jacobian calculators:
Essentially you can eliminate for loops (and significantly speed computation time) at the cost of memory. Probably too much memory cost for your large Hessian, but thought I’d add this in case useful.