Calculating the Hessian with 0.2?

oneweirdkerneltrick · September 10, 2017, 3:28am

Hi! I am trying to get the Hessian but since torch.autograd.grad only takes in scalar values, my implementation is very inefficient.

Basically what I’m doing is:

input = Variable(torch.randn(1,5), requires_grad = True)
target = Variable(torch.zeros(1))
criterion = nn.NLLLoss()
pred = model(input) #let's say model is just an MLP that outputs a logsoftmax
nll = criterion(pred, target)
input_grad = torch.autograd.grad(nll, input, create_graph=True)[0]
#autograd only works with a scalar so i gotta call grad on input_grad N times, where N is the dimensionality of the input.
hessian = [torch.autograd.grad(input_grad[0][i], input, create_graph=True[0]) for i in range(input.size(1))]

#now this is what I wanna do, but it doesn't work because input_grad is not a scalar.
hessian = torch.autograd.grad(input_grad, input, create_graph=True)[0]

Is there a better way to obtain the Hessian in a more efficient manner?

On a related note, is there an example of calculating the diagonal approximation to the Hessian, along the lines of (http://yann.lecun.com/exdb/publis/pdf/becker-lecun-89.pdf)?

Thanks!

apaszke · September 11, 2017, 2:28am

It’s not so much that grad takes in scalar values (it can work with tensors of arbitrary size) - the limiting factor is that the autograd module can only compute jacobian-vector products. In case of taking grad of the gradients it’s actually computing a hessian-vector product of the original function. So, to recover the full hessian you have to do these N grad calls, where each call will multiply it by a one-hot vector (let’s say with a 1 in ith place), and will recover the ith row of the actual hessian.

I’m not familiar with the method for diagonal approximation, so I’d need to read up to answer your second question.

Hope this helps! Let me know if anything was unclear or you need any more help!

Dynasty · December 5, 2017, 2:06pm

To anyone in the future looking for a solution to getting the diagonal of the Hessian: consider using the second directional derivative. Suppose f(x) is the objective function, then the second directional derivative can be calculated like this:

(d/dh(d/dh f(x+a*h))) | h=0

Here the “a” vector indicates the direction of interest and “h” is a scalar value.
To retrieve an element of the diagonal just plug in the corresponding one-hot vector for “a”.

I was able to easily find the second directional derivative using http://www.matrixcalculus.org/
Just plug in f(x+a*h) and differentiate with respect to h. This gives the first directional derivative. Take the answer and plug it in again to obtain the second directional derivative. If for some reason you cannot find the second directional derivative, but can find the first, then PyTorch should be able to obtain it for you anyway via autograd.

evcu · February 7, 2018, 5:38pm

Hi @Dynasty. Can you elaborate on how directional derivative is related to getting the diagonal of the hessian. I am looking for a method to calculate it as we calculate hessian-vector product.

Dynasty · February 8, 2018, 3:37pm

@evcu This method using the directional derivatives is really only useful if you want to find the diagonal of the Hessian. This method also gives one element of the diagonal for each iteration. The difference is that the method @apaszke talked about requires a conjugate gradient descent iteration for each entry in the diagonal, but the directional derivative method does not.

Typically the directional derivative is calculated by projecting the gradient onto a normalized vector in the direction of interest. In this setup, the gradient is computed explicitly, and the directional derivative is derived from this result. The trick here is to do the process in reverse: calculate the directional derivative explicitly and then use this result to find a piece of the gradient. Since the directional derivative isn’t really used for many applications, its usefulness for computing part of the Hessian isn’t obvious. For this reason I’m planning on writing a paper to submit to a machine learning journal. It will probably be a short two page paper since the method is very simple. I believe that this method works because the gradient projection is equal to the directional derivative. I will probably formalize this result in the paper (it should not be too difficult).

Here is a simple example to demonstrate the method. Suppose you were trying to find the Hessian of this function (x is a vector, sin is applied element-wise, and sum just adds up all the elements in the vector):

f(x)=sum(sin(x))

Now we find the first and second directional derivative using matrix calculus (a is the direction vector, h is a scalar, and ⊙ is element-wise multiplication):

g(x) = f(x+a*h) = sum(sin(x+a*h))
∂g(x)/∂h = g'(x) = cos(x+h*a)ᵀ*a = cos(x+h*a)⋅a
∂g'(x)/∂h = g''(x) = -(a⊙sin(x+h*a))ᵀ*a = -(a⊙sin(x+h*a))⋅a

Let’s take the case where x=〈a,b,c〉is a 3 element vector. To find the expression for the first entry in the diagonal of the Hessian, substitute h=0 and a=〈1,0,0〉 (aka the vector pointing along one of the axes) into g’’(〈a,b,c〉). To find the second entry in the diagonal, substitute h=0 and a=〈0,1,0〉. For the third entry in the diagonal, substitute h=0 and a=〈0,0,1〉. This substitution process results in the diagonal, which I’ve represented here as a vector:

diag(H(f)(〈a,b,c〉))=〈-sin(a),-sin(b),-sin(c)〉

You can verify that this is the correct answer by finding the Hessian matrix in Mathematica.

I am unsure as to whether or not PyTorch can obtain the Hessian diagonal using autograd. It might take a modification to the source code to add this feature. I assume it could be added in a similar way that the current Hessian-vector product was added. Maybe @apaszke can comment on the implementation details.

evcu · February 8, 2018, 4:03pm

Interesting! Thanks for the detailed explanation. The last line(second derivative) reminds me the Hessian-vector product that we use to calculate a row of hessian. If considered from that perspective it looks like calculating a row of the hessian and then getting the right element, so that overall it is equivalent to calculating the whole hessian and getting the diagonal.

I am not sure there is a space for optimization since, it looks like we need to set the $$a$$ to calculate the second derivative and this would require N hessian-vector product, compare to 1 hessian-vector product for getting a row or column.

Dynasty · February 8, 2018, 4:36pm

@evcu If you want the full Hessian, the algorithm @apaszke described would definitely be the way to go. I’m not sure how autograd is implemented, but I think it might still work out okay. The reason I’m saying this is because the derivative is with respect to h, which is a scalar value. The typical autograd derivative calculations are with respect to vectors (or sometimes matrices/tensors). I will have to think about this some more. Hopefully someone with insight into how autograd works will comment on this issue.

apaszke · February 21, 2018, 9:52pm

@Dynasty thanks for posting those formulas, that’s quite interesting. The downside is that they always end with a dot product with this a vector, where you substitute it with a one-hot value to get a single entry. Note that removing the dot product would just give you a row of the hessian. In torch.autograd (or any other reverse-mode AD software that doesn’t do partial evaluation - I’m not aware of any) this would be equivalent to just computing the full row, and then only taking the diagonal entry, so there are no savings