Is there a major difference between Adam and SparseAdam implementation?
I’m using the SparseAdam for optimizing the embedding layer in my model, and I noticed that the model requires fewer epochs to converge if I instead used the Adam optimizer to optimize the embedding layer with sparse gradients disabled.