I tried Googling it and I couldn’t find any implementations of that optimizer for PyTorch. Most of what exists is variations on first-order gradient descent. If your gradients are not stochastic you might try to use torch.optim’s implementation of the second-order optimizer L-BFGS (be sure to set line_search_fn='strong_wolfe' or you risk the optimizer ‘blowing up’ due to accepting a step which increases the loss).
I know this thread is a bit old, but for anyone still looking for a Levenberg-Marquardt implementation in PyTorch, I’ve developed one: torch-levenberg-marquardt. Hope it helps anyone who comes across this!
Hi Robert, I think my implementation doesn’t suffer from the issues mentioned in that discussion. Using jacrev + vmap to compute the Jacobian in PyTorch is fast and memory-efficient, especially compared to other methods I’ve tried in both PyTorch and TensorFlow.
Of course, you cannot expect to train models with billions of parameters using LM, but for certain architectures, I’ve been able to train models with millions of parameters on a GPU.