LBFGS not functioning the way it is

Dear all,
LBFGS is not functioning the way it is.
When I had given the function to optimize i.e Square function (X-6)^2 + (Y-6)^2 instead of rosenbrock in test cases, it is not converging to [6,6] with optimal function value close to 0.
More over the hessian in LBFGS should be a square matrix of the dimension of the input vector, where as in the current implementation, it is taking it as scalar.
Please go through the implementation of BFGS implementation in the following request
Could you please let me know if I can have a discussion with you regarding this issue so that I can explain exactly what’s the issue and can make a pull request for this algorithm?

1 Like

True … it is happening the same with me also for other quadratic function which is similar to this.

As per the documentation, LBFGS is taking a only one step as per the step definition in the below link.
But inside the file, it is taking more steps and not the same as reported in the documentation.

The original Torch7 L-BFGS was:

heavily inspired by minFunc (Mark Schmidt)

I have been having issues getting optim.lbfgs to perform (in terms of memory usage) like the Torch7 version. In the Torch7 version, the memory usage stopped increasing one you started running your feval/closure function. But the PyTorch version just keeps increasing in memory usage as the iterations go by, until it hits some ceiling, or your script crashes with an “out of memory” error.

Yeah, finally someone steps in to make LBFGS more stable. I have a hard time with LBFGS to make my models converge, even when my problem was to optimize a simple 1D function.

With regard to original comments, it would probably be useful if you can provide some code that reproduces the trouble you’re having. I tried your example problem and it converged fine, but perhaps for some initial conditions or learning rates it will not.

There is no Hessian in L-BFGS. The idea of BFGS is to approximate the Hessian, and the idea of Limited-memory BFGS is to approximate the Hessian (which requires O(n^2) storage) with O(n) storage. In fact, the approximate Hessian is never materialized, only the approximate Hessian-Gradient product is (i.e. the descent direction). In Nocedal and Wright, they advise to initialize the approximate Hessian as a diagonal matrix, so you need only compute a scalar, which is what you are referring to.

FWIW, I checked the implementation against the text and the implementation in Scala’s Breeze optimization library and they look the same, apart from choice of step size. The choice of a constant step size was by design though, as noted in the post by ProGamerGov. I can’t comment on the memory issues, and haven’t been able to reproduce any failures for the toy scenario mentioned.

Making a shameless self plug, but I just released my implementation of L-BFGS (based on various codes, including Mark Schmidt’s minFunc, the original PyTorch implementation, and Michael Overton’s weak Wolfe line search). It addresses some of the issues you all mention here and is also compatible with recent advancements in stochastic quasi-Newton methods, such as multi-batch and Schraudolph’s approaches for performing quasi-Newton updating, as well as Powell damping. Feel free to take a look, and any feedback would be appreciated!

Link is here:

1 Like

Link is here: .

@hjmshi this works great! Do you know if there are any plans to include this as optimizer in Pytorch?

Thanks! We are looking into doing that at the moment. However, we probably would only expect to include full-batch L-BFGS method into PyTorch (which is most stable) as stochastic quasi-Newton methods (particularly for deep learning) is still an active research area.

@hjmshi thanks for your reply! I have some questions but will open an issue in your repo to discuss them.

Sounds great! We definitely would appreciate any feedback or questions.