What is the optimal way to set the learning rate?

Hello all :slight_smile:

I have a question for the experts about setting hyperparameters like learning rate and dropout for NMT project (Multi-head attention). The size of training data is big and each training session takes a long time, so it is hard to set it during experiments. Total training data as below:

  1. Synthetic data for pre-training in 6 GB.
  2. First step fine-tuning data in 20 MB.
  3. Second step fine-tuning monolingual data 200 MB.

I have decided to use learning rates with each dataset like 1e-4, 1e-3, and 1e-1 but I’m not sure these are the best options.

Is there any statistical method to calculate the value of learning rate and dropout?

The multi-head attention hyperparameters are below:

D_MODEL = 256
N_LAYERS = 4
N_HEADS = 8
HIDDEN_SIZE = 512
MAX_LEN = 400
DROPOUT = 0.15
BATCH_SIZE = 64
LR = 1e-4
N_EPOCHS = 5

The available server is two GPUs TITAN RTX - Total GPU RAM is 50 gb

Kind regards,
Aiman Solyman

It’s not directly calculating, but you could try out optuna / optuna-github which is “An open source hyperparameter optimization framework to automate hyperparameter search”
A simple pytorch example can be found here

1 Like

hyperparam search libs are a bit overkill for two independent parameters (you only need to perform two line searches).

also, some hyperparams are tunable with reduced dataset and/or epoch count

for LR it is common to do warmup & decay scheduling instead; early peak LR can be set as high as possible, and decay parameters are tuned instead (alternatively, ReduceLROnPlateau scheduler does an auto decrease)

1 Like

@RaLo4 & @googlebot Thank you so much, both comments are so valuable, I’m working on them now :slight_smile: