@evcu This method using the directional derivatives is really only useful if you want to find the diagonal of the Hessian. This method also gives one element of the diagonal for each iteration. The difference is that the method @apaszke talked about requires a conjugate gradient descent iteration for each entry in the diagonal, but the directional derivative method does not.
Typically the directional derivative is calculated by projecting the gradient onto a normalized vector in the direction of interest. In this setup, the gradient is computed explicitly, and the directional derivative is derived from this result. The trick here is to do the process in reverse: calculate the directional derivative explicitly and then use this result to find a piece of the gradient. Since the directional derivative isn’t really used for many applications, its usefulness for computing part of the Hessian isn’t obvious. For this reason I’m planning on writing a paper to submit to a machine learning journal. It will probably be a short two page paper since the method is very simple. I believe that this method works because the gradient projection is equal to the directional derivative. I will probably formalize this result in the paper (it should not be too difficult).
Here is a simple example to demonstrate the method. Suppose you were trying to find the Hessian of this function (x is a vector, sin is applied element-wise, and sum just adds up all the elements in the vector):
f(x)=sum(sin(x))
Now we find the first and second directional derivative using matrix calculus (a is the direction vector, h is a scalar, and ⊙ is element-wise multiplication):
g(x) = f(x+a*h) = sum(sin(x+a*h))
∂g(x)/∂h = g'(x) = cos(x+h*a)ᵀ*a = cos(x+h*a)⋅a
∂g'(x)/∂h = g''(x) = -(a⊙sin(x+h*a))ᵀ*a = -(a⊙sin(x+h*a))⋅a
Let’s take the case where x=〈a,b,c〉is a 3 element vector. To find the expression for the first entry in the diagonal of the Hessian, substitute h=0 and a=〈1,0,0〉 (aka the vector pointing along one of the axes) into g’’(〈a,b,c〉). To find the second entry in the diagonal, substitute h=0 and a=〈0,1,0〉. For the third entry in the diagonal, substitute h=0 and a=〈0,0,1〉. This substitution process results in the diagonal, which I’ve represented here as a vector:
diag(H(f)(〈a,b,c〉))=〈-sin(a),-sin(b),-sin(c)〉
You can verify that this is the correct answer by finding the Hessian matrix in Mathematica.
I am unsure as to whether or not PyTorch can obtain the Hessian diagonal using autograd. It might take a modification to the source code to add this feature. I assume it could be added in a similar way that the current Hessian-vector product was added. Maybe @apaszke can comment on the implementation details.