What is the word_language_model example's Optimizer?

tbfang · May 22, 2017, 1:25pm

In the LSTM language modelling example, https://github.com/pytorch/examples/blob/master/word_language_model/main.py,
we have a learning rate of 20, which seems to be very high. What was the optimizer used, and are there any explanations for choosing a initial learning rate of 20 (with no decay, as far as I can see)? Thank you!

vabh · May 23, 2017, 9:23am

The parameter update step is defined here (a vanilla gradient descent step):

github.com

pytorch/examples/blob/master/word_language_model/main.py#L141


model.eval()
total_loss = 0
ntokens = len(corpus.dictionary)
hidden = model.init_hidden(eval_batch_size)
for i in range(0, data_source.size(0) - 1, args.bptt):
    data, targets = get_batch(data_source, i, evaluation=True)
    output, hidden = model(data, hidden)
    output_flat = output.view(-1, ntokens)
    total_loss += len(data) * criterion(output_flat, targets).data
    hidden = repackage_hidden(hidden)
return total_loss[0] / len(data_source)




def train():
# Turn on training mode which enables dropout.
model.train()
total_loss = 0
start_time = time.time()
ntokens = len(corpus.dictionary)
hidden = model.init_hidden(args.batch_size)
for batch, i in enumerate(range(0, train_data.size(0) - 1, args.bptt)):

The learning rate is made to decay in the train loop, defined here:

github.com

pytorch/examples/blob/master/word_language_model/main.py#L177




    if batch % args.log_interval == 0 and batch > 0:
        cur_loss = total_loss[0] / args.log_interval
        elapsed = time.time() - start_time
        print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
                'loss {:5.2f} | ppl {:8.2f}'.format(
            epoch, batch, len(train_data) // args.bptt, lr,
            elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss)))
        total_loss = 0
        start_time = time.time()


# Loop over epochs.
lr = args.lr
best_val_loss = None


# At any point you can hit Ctrl + C to break out of training early.
try:
for epoch in range(1, args.epochs+1):
    epoch_start_time = time.time()
    train()
    val_loss = evaluate(val_data)

Andy_Tseng · December 5, 2017, 5:27am

Thank you for the explanation @vabh
I am still confused why initial learning rate = 20 works very well in this example. Though initial learning varies on different task and dataset, usually it is less than (or equally to) 1.0 in SGD optimizer. I tried 1.0 as initial learning rate here, but it become worse. Does anyone have a good explanation on this? Thanks in advance!

quanpn90 · December 5, 2017, 12:35pm

I think it comes down to the scale of the gradients.

In that example, the loss is divided by all words in the mini batch, thus the scale is more and requires a larger LR.

In the previous works in LM normally the loss is only divided by the mini batch size, so LR=1.0 works.