Preference of learning scheduler

when do we prefer which learning scheduler
cosine annealing learning scheduler
exponential learning scheduler
is it trial and error only or can we prefer one over other based on situations

Probably trial and error + intuition. :slight_smile:

Things I often see is fixed learning rate throughout or a learning rate that is decreased at 2-3 discrete points in time during learning. And then there is the 1-cycle policy advocated by by L. Smith in his papers that has seens lots of success (see below).

Also, it depends a bit on the domain. Training purely convolutional networks in vision usually work differently than LSTMs in NLP.
For the common ones, there are extensive hyperparameter studies and recipies. In no particular order

  • David Page has a series on training resnets that provides lots of insights. There also are various “bags of tricks”-papers that collect principles.
  • For NLP, S. Merity et al’s paper on AWD-LSTM would seem to be very influential in presenting a good training approach.
  • @jphoward discusses L. Smith’s 1-cycle-policy on his courses and lots of people following the courses wrote blog posts about their successes with it. (I think he also has stuff on AWD-LSTM for NLP, but I forget.)
  • For transformer language models (BERT/GPT and co), Huggingface has some info on training transformers in general in the documentation for their great library. There are some paper emerging about efficient training, but I think the effort to bring down the total GPU ressources used is still underway.
  • In general, I’d look for “well done papers with good results and code” in your area of interest to see what they’re doing. Not all papers take great care in squeezing out the last bit of performance (because it’s not their focus), but usually they have experimented to find a method that generally works well.

Best regards