Why does scaling the learning rate and the loss for a same factor produce a different result?

Mafa_can · July 18, 2021, 4:41am

Hello,

I have a question about the vanilla SGD optmizer. I noticed that if I scale for a same factor the loss and the learning rate like for example:

    optimizer = torch.optim.SGD(lenet_model.parameters(), lr = learning_rate / alpha)
    criterion_train = nn.CrossEntropyLoss()
    .......
    loss = alpha * criterion_train(train_labels_head, train_labels)

the result of the training is different (of course I am comparing the same training with all the seeds set to the same value).

In theory, scaling the loss and the learning rate should not produce any differences since they will cancel each other, is it correct?
Is it because of the limited number precision?