I’ve recently released a modular implementation of L-BFGS that is compatible with many recent algorithmic advancements for improving and stabilizing stochastic quasi-Newton methods and addresses many of the deficiencies with the existing PyTorch L-BFGS implementation. It is designed to provide maximal flexibility to researchers and practitioners in the design and implementation of stochastic quasi-Newton methods for training neural networks.
Since the implementation is quite immature, I’m desperately looking for feedback and wanted to provide an open forum for practitioners to share with us their experiences with the code. If you had the opportunity to try it out, please let us know how it’s performing - it will be beneficial for us in improving the implementation and potentially even in posing future research questions on stochastic quasi-Newton methods, particularly for training neural networks.
It’s not technical commentary, but there are two enhancements that I’d suggest:
Also, for some people Jupyter notebooks with a convergence graph or so may make it more attractive. At least I, personally, always look for notebooks first. (Opinions may differ though.)
I’d probably try to make things more PyTorchy, e.g.:
Replace the dependency on keras for the datasets by one on torchvision,
provide a dataloader that facilitates the overlapping sampling (I must admit I’m not sure how easy that is),
consider integrating the various bits of the optimizer into optimizer.step and pass a closure.
Another thing to try might be reaching out to the GPyTorch people / include an example with Gaussian processes, I’ve always thought of (L-) BFGS as the premier way to train GPs.
Thanks for the feedback, Thomas! Many great points, and I will definitely look into reaching out to the GPyTorch community.
Regarding integrating the various bits of the optimizer into optimizer.step, we have chosen not to do this in order to ensure flexibility in the sample selection when defining the updates and gradient differences in L-BFGS. This (I think) is the easiest way to make it compatible with both multi-batch and full-overlap L-BFGS. We also suspect that as stochastic quasi-Newton methods improve and other potential mechanisms are introduced, one may need to make use of the two-loop recursion outside of simply defining the search direction, which is why we had chosen to make this separate for now. (One example of this is in what our group calls “progressive batching” or “adaptive sampling” tests where adaptive tests are used for automatically increasing the batch size in the algorithm.)
Could you please speak to the distinction (if any) between your module and the native pytorch LBFGS module? Did your module improve upon the pytorch implementation, and if so, how? Thanks.