SM3 for PyTorch (Memory-efficient Adapative Optimizer)

enealor · February 4, 2020, 9:04pm

Hello all,

I recently learned about the SM3 optimization algorithm. I implemented it in PyTorch to better understand the paper, and I am sharing the code here. It’s available at PyTorch-SM3. The implementation itself includes the features of the TensorFlow version (support for dense and sparse tensors) as well as a feature they mention but do not include in their final version (exponential moving averages). The authors designed the algorithm to be a memory-efficient alternative to Adam and Adagrad for large NLP architectures such as Transformer-Big and BERT.

I have tested my version numerically and tried it for a few simple networks to verify that nothing arose during training. However, I am not working with models as large as the optimizer is intended for. If you do work with models that large and notice any issues, please let me know.

Yaroslav_Bulatov · February 4, 2020, 9:56pm

Looks nice! I’m curious how performance compares against the AdaFactor optimizer – https://github.com/pytorch/fairseq/blob/master/fairseq/optim/adafactor.py . AdaFactor was used to train the 1.5B GPT-2 model. SM3 seems to give extra extra flexibility in choosing factors, whereas AdaFactor is limited to output dimension,input dimension split.

enealor · February 4, 2020, 10:35pm

The authors compare SM3 to Adafactor in their article. I’ll quote the relevant paragraph as I don’t think I can say it better.

Comparison with Adafactor: Adafactor is a very effective method for space-efficient adaptive optimization. SM3 and Adafactor differ in a number of important ways. First, Adafactor is only defined for matrix-shaped parameters while SM3 applies to tensors of arbitrary dimensions, and even more generally, to any predefined cover of the parameters. Second, Adafactor is in essence a fixed learning-rate algorithm, being a memory-constrained variation of Adam, and often requires a manually devised learning-rate schedule to ensure convergence. In contrast, SM3 adapts its learning rates in an adaptive, data-driven manner similar to Adagrad. Finally, SM3 comes with rigorous convergence guarantees in stochastic convex optimization settings.

And yeah, it seems that SM3 offers amazing flexibility in choice of cover. They end up choosing similar to Adafactor despite this flexibility as their experiments showed that rows/columns had correlated activations.