Adaptive optimizer vs SGD (need for speed)

Adaptive optimizers can produce better models than SGD, but they take more time and resources than SGD.

Now the challenge is I have a huge amount of data for training, adagrad takes 4x longer than SGD, and I want to reduce the time needed and still end up with a good model.
A few options I’m considering,

  1. Just sample x% of the huge dataset for training, adagrad can still produce a good model, with less training time.
  2. Need to use 100% of the dataset, but get more GPUs for training.
  3. Need to use 100% of the dataset, but try any other ideas?

Hope someone could share interesting ideas, thanks.

How about a hybrid approach where you go back and forth between adaptive and SGD, using adaptive to discover some locally good learning rates, then using SGD to optimize at those rates? To illustrate a bit more, it would be something like this (where T is some generic time unit):

  • adaptive learn for T
  • SGD for 10 * T, using the most recent learning rates from the previous step
  • adaptive learn for T
  • SGD for 10 * T, using the most recent learning rates from the previous step
  • … keep doing this …

This way you spend almost all of your compute time in SGD mode but you use adaptive learning to guide you. The swimming analogy would be that you don’t swim with your head out of the water the whole way, you just pick it up from time to time to ensure you’re headed in the right direction.

Hey Andrei, this back-forth approach sounds interesting, but I’m not sure if it can work, because after T of adaptive learn, how could its latest learning rate be used for SGD for 10*T?
My thought is the learning rate from adaptive learn is actually a per feature learning rate vector, and how to borrow it for SGD?

SGD supports per-parameter learning rates, you just have to pass them individually when you instantiate the optimizer.

Here’s another relevant thread on tips for doing this in a convenient way.

Thanks for the per-param links.
However, that doesn’t seem to be the trick I’m looking for, for example I have a big Embedding layer in the model, and each row of the Embedding layer is an embedding for a word in the vocabulary, and adaptive optimizers could apply different actual learning rate for each dim of the embedding vector, but the SGD per-param option just assigns the same learning rate for the whole Embedding layer params I think.

My understanding is that Adam is different from SGD in that it makes its step size (multiplier of the batch gradient) a parameter-dependent and stateful quantity (proportional to the recent mean / sqrt(variance) of the gradient for that parameter, corrected for some biases). So ultimately SGD and Adam each compute the same batch gradients, but then SGD’s step size (as a fraction of the gradient) is the same across all parameters, while Adam’s is allowed to vary across parameters, specifically it varies according to the ratio of mean(gradient)/sqrt(variance(gradient)). So I don’t see any reason why you couldn’t run Adam for a while, and then take its per-parameter learning rates and apply them to a SGD optimizer for subsequent steps.

If you think I’m wrong, I’d love to understand why! I looked for a few minutes and couldn’t find others recommending this back-and-forth between Adam and SGD, which might mean it’s doomed to fail, or it might mean that most folks don’t run into the type of compute constraint you’re hitting.