Any COCOB optimizer implementation in PyTorch?

This new fancy optimizer requires no learning rate at all and seems to perform quite well compared to previous optimizers.
See the paper for details.
Paper: “Backprop without Learning Rates Through Coin Betting”, Orabona & Tommasi 2017
The original implementation:

I wouldn’t spend my time on anything that only shows MNIST results.. I wouldn’t spend my time on anything that only shows results on small datasets. It’s really hard to say whether it generalizes to anything in practice or not.

For anyone else visiting this thread, this optimizer is equivalent while reading the code intuitively, I see it roughly doing something like: running several parallel optimizers on the model, each with a different learning rate, and then averaging the weights out.

Edited because the comment oversimplified and incorrectly represented the original paper. See comments below.


This time-series modeling winner used COCOB for RNN training. Can you show more detail on how to average the outputs of several optimizers in PyTorch?

One more update here, looks like a clear implementation of this in PyTorch already.

Found this old thread for a random chance.

I am the author of that paper, I am a bit surprised by your comments:

  • the experiments are not on MNIST only
  • the optimizer is obviously not equivalent to running different learning rate and averaging

Did you actually read the paper?

It’s been a year, and I honestly dont remember reading the paper. I read through the TF implementation that was linked in the post, and I incorrectly assumed the code was for results in the paper. Had I spent a few more minutes, I would’ve know that it was done on MNIST, CIFAR10 and Penn Tree Bank. I apologize for being “that” person.
That being said, by MNIST, I meant “experiments done on small datasets”, as I’ve been burned multiple times on picking up an MNIST / CIFAR10 paper and it not working on large scale. And by “equivalent”, that was bad wording, I wanted people to get an intuition of the paper, similar to how you described in the README.

Thanks for your reply, I appreciate it.

For future readers, let me just explain a bit better the intuition of the paper:

Imagine running in parallel multiple optimization algorithms, each one with a different learning rate. Now, put on top of them a master algorithm able to decide which one is giving the best performance. This would give rise to a perfect optimization algorithm, at least for convex functions. And, indeed, this is exactly the MetaGrad algorithm from NIPS’16 (
Now, image instead having an infinite number of learners in parallel and the algorithm on top doing an expectation of all of them with respect to the best distribution of learning rates. The infinity makes things easier because now the update becomes in a simple closed form. And this is exactly how my algorithm works.
The coin-betting trick is a just an easy way to design this kind of algorithms.