Hello all
I have a question for the experts about setting hyperparameters like learning rate and dropout for NMT project (Multi-head attention). The size of training data is big and each training session takes a long time, so it is hard to set it during experiments. Total training data as below:
- Synthetic data for pre-training in 6 GB.
- First step fine-tuning data in 20 MB.
- Second step fine-tuning monolingual data 200 MB.
I have decided to use learning rates with each dataset like 1e-4
, 1e-3
, and 1e-1
but I’m not sure these are the best options.
Is there any statistical method to calculate the value of learning rate and dropout?
The multi-head attention hyperparameters are below:
D_MODEL = 256
N_LAYERS = 4
N_HEADS = 8
HIDDEN_SIZE = 512
MAX_LEN = 400
DROPOUT = 0.15
BATCH_SIZE = 64
LR = 1e-4
N_EPOCHS = 5
The available server is two GPUs TITAN RTX - Total GPU RAM is 50 gb
Kind regards,
Aiman Solyman