About learning rate and weight decay "fine-tuning"

gass_jb · September 1, 2021, 11:37am

I am really confused about choosing the best learning rate and weight decay. I conducted an ablation study and got the following results:

For both experiments a weight decay = 0.001 was used.
For the first plot: lr = 0.003 => best validation loss: 0.405 reached in epoch 4, train_loss = 0.326 at same epoch
For the second plot: lr = 0.03 => best validation loss: 0.403 reached in epoch 9, train_loss = 0.411 also at same epoch

Sayed_Nadim · September 1, 2021, 12:10pm

I would suggest using hyperparameter search for finding the best values. Try Optuna, Raytune etc. for tuning. Refer to This question.

gass_jb · September 1, 2021, 12:15pm

Thx Sayed for your quick answer, but I would like to get best analysis of the above-shared plots, which one performed better.

Sayed_Nadim · September 1, 2021, 12:29pm

Understood. In my opinion, with smaller lr, you don’t risk “overshoot” during gradient descent. But it may take more time converge. Higher lr can make training faster but also can “oscillate” for convergence.
In the case lr = 0.003, the validation loss seems stable, with no significant change. I am not sure if it will improve further from that. In the second case where lr = 0.03, the validation loss goes high and then drops significantly low. It may indicate a bit of overfitting tendency.

I would recommend training for some more epochs for both cases to see if the validation value improves. Also, are you performing weight decay based on epoch or steps?

gass_jb · September 1, 2021, 12:37pm

I trained for some more epochs, the model starts overfitting and the values went higher for both cases. The weight decay is performed on epoch

gass_jb · September 1, 2021, 12:44pm

This is what what I got, when I trained for more epochs with lr = 0.003

lr0,003-wd0,001_8-epochs

Sayed_Nadim · September 1, 2021, 12:47pm

Can you try with linear weight decaying instead of fixed value i.e. 1st epoch - 0.001, 2nd epoch - 0.0008, 3rd epoch - 0.0004 etc? I guess for lr = 0.003, having a constant weight decay of 0.001 is reducing lr faster than with lr = 0.03.

Sayed_Nadim · September 1, 2021, 12:48pm

Yes, the model is diverging (potentially overfitting) when you train for more epoch.