In the LSTM language modelling example, https://github.com/pytorch/examples/blob/master/word_language_model/main.py,
we have a learning rate of 20, which seems to be very high. What was the optimizer used, and are there any explanations for choosing a initial learning rate of 20 (with no decay, as far as I can see)? Thank you!
Thank you for the explanation @vabh
I am still confused why initial learning rate = 20 works very well in this example. Though initial learning varies on different task and dataset, usually it is less than (or equally to) 1.0 in SGD optimizer. I tried 1.0 as initial learning rate here, but it become worse. Does anyone have a good explanation on this? Thanks in advance!