What is the optimal way to set the learning rate?

Aiman_Mutasem-bellh · February 23, 2021, 1:19pm

Hello all

I have a question for the experts about setting hyperparameters like learning rate and dropout for NMT project (Multi-head attention). The size of training data is big and each training session takes a long time, so it is hard to set it during experiments. Total training data as below:

Synthetic data for pre-training in 6 GB.
First step fine-tuning data in 20 MB.
Second step fine-tuning monolingual data 200 MB.

I have decided to use learning rates with each dataset like 1e-4, 1e-3, and 1e-1 but I’m not sure these are the best options.

Is there any statistical method to calculate the value of learning rate and dropout?

The multi-head attention hyperparameters are below:

D_MODEL = 256
N_LAYERS = 4
N_HEADS = 8
HIDDEN_SIZE = 512
MAX_LEN = 400
DROPOUT = 0.15
BATCH_SIZE = 64
LR = 1e-4
N_EPOCHS = 5

The available server is two GPUs TITAN RTX - Total GPU RAM is 50 gb

Kind regards,
Aiman Solyman

RaLo4 · February 23, 2021, 3:47pm

It’s not directly calculating, but you could try out optuna / optuna-github which is “An open source hyperparameter optimization framework to automate hyperparameter search”
A simple pytorch example can be found here

googlebot · February 23, 2021, 7:59pm

hyperparam search libs are a bit overkill for two independent parameters (you only need to perform two line searches).

also, some hyperparams are tunable with reduced dataset and/or epoch count

for LR it is common to do warmup & decay scheduling instead; early peak LR can be set as high as possible, and decay parameters are tuned instead (alternatively, ReduceLROnPlateau scheduler does an auto decrease)

Aiman_Mutasem-bellh · February 24, 2021, 7:51am

@RaLo4 & @googlebot Thank you so much, both comments are so valuable, I’m working on them now