Adam can adapt its learning rate by the gradient updating. I think we may not need the learning rate scheduler.
However, I worry that if with that kind of learning rate scheduler in Adam can jump out of the local minimal or get away from the local minimal
In transfer_learning_tutorial, it use momentum SGD with a learning scheduler.
Yes I have had such experience.
Now in my project, I split
num_epochs into three parts.
num_epochs_1 warm up.
num_epochs_2 Adam for speeding up covergence.
num_epochs_3 momentum SGD+CosScheduler for training.
My friend used Adam without learning rate scheduler in his project, and he found that the loss started to rise after some epochs.
You can find some discuss here. Although Adam can adaptively adjust the learning rate, but such ability is limited.
At least, for me, I think momentum SGD is the most stable optimizer and Adam/AdamW is a good tick to speed up covergence.
All these are my personal experiences. Is it necessary to use a learning scheduler? Maybe as the answer in the link says,
Pytorch Adam algorithm implementation follows changes proposed in Decoupled Weight Decay Regularization which states:
Adam can substantially benefit from a scheduled learning rate multiplier. The fact that Adam
is an adaptive gradient algorithm and as such adapts the learning rate for each parameter
does not rule out the possibility to substantially improve its performance by using a global
learning rate multiplier, scheduled, e.g., by cosine annealing.