What is the word_language_model example's Optimizer?

In the LSTM language modelling example, https://github.com/pytorch/examples/blob/master/word_language_model/main.py,
we have a learning rate of 20, which seems to be very high. What was the optimizer used, and are there any explanations for choosing a initial learning rate of 20 (with no decay, as far as I can see)? Thank you!

1 Like

The parameter update step is defined here (a vanilla gradient descent step):

The learning rate is made to decay in the train loop, defined here:

Thank you for the explanation @vabh
I am still confused why initial learning rate = 20 works very well in this example. Though initial learning varies on different task and dataset, usually it is less than (or equally to) 1.0 in SGD optimizer. I tried 1.0 as initial learning rate here, but it become worse. Does anyone have a good explanation on this? Thanks in advance!

1 Like

I think it comes down to the scale of the gradients.

In that example, the loss is divided by all words in the mini batch, thus the scale is more and requires a larger LR.

In the previous works in LM normally the loss is only divided by the mini batch size, so LR=1.0 works.