How to apply layer-wise learning rate scheduler? e.g., different warmup epochs for different layer in cosine decay scheduler with linear warmup.
I try a lot but no clue.
The schedulers will apply their learning rate update to all parameter groups defined in the optimizer, if I’m not mistaken, so I would assume creating separate optimizers with their corresponding schedulers might be a valid approach. @albanD might correct me if I’m missing a more flexible approach for schedulers.
1 Like
Yes, it works, I actually have thought about these ideas before, but it seems way more complicated than layer-wise learning rate in optimizer.
It seems over complicated since I’ll create multiple optimzers and multiple schedulers, any simplification?