Difference between SparseAdam and Adam behavior

amoussawi · November 4, 2018, 12:42pm

Is there a major difference between Adam and SparseAdam implementation?

I’m using the SparseAdam for optimizing the embedding layer in my model, and I noticed that the model requires fewer epochs to converge if I instead used the Adam optimizer to optimize the embedding layer with sparse gradients disabled.

SimonW · November 5, 2018, 1:48pm

Yes there is. The doc for SparseAdam directly says

In this variant, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.