I am really confused about choosing the best learning rate and weight decay. I conducted an ablation study and got the following results:
For both experiments a weight decay = 0.001 was used.
For the first plot: lr = 0.003 => best validation loss: 0.405 reached in epoch 4, train_loss = 0.326 at same epoch
For the second plot: lr = 0.03 => best validation loss: 0.403 reached in epoch 9, train_loss = 0.411 also at same epoch
I would suggest using hyperparameter search for finding the best values. Try Optuna, Raytune etc. for tuning. Refer to This question.
Thx Sayed for your quick answer, but I would like to get best analysis of the above-shared plots, which one performed better.
Understood. In my opinion, with smaller lr, you don’t risk “overshoot” during gradient descent. But it may take more time converge. Higher lr can make training faster but also can “oscillate” for convergence.
In the case lr = 0.003, the validation loss seems stable, with no significant change. I am not sure if it will improve further from that. In the second case where lr = 0.03, the validation loss goes high and then drops significantly low. It may indicate a bit of overfitting tendency.
I would recommend training for some more epochs for both cases to see if the validation value improves. Also, are you performing weight decay based on epoch or steps?
I trained for some more epochs, the model starts overfitting and the values went higher for both cases. The weight decay is performed on epoch
This is what what I got, when I trained for more epochs with lr = 0.003
Can you try with linear weight decaying instead of fixed value i.e. 1st epoch - 0.001, 2nd epoch - 0.0008, 3rd epoch - 0.0004 etc? I guess for lr = 0.003, having a constant weight decay of 0.001 is reducing lr faster than with lr = 0.03.
Yes, the model is diverging (potentially overfitting) when you train for more epoch.