Hello,
I have a question about the vanilla SGD optmizer. I noticed that if I scale for a same factor the loss and the learning rate like for example:
optimizer = torch.optim.SGD(lenet_model.parameters(), lr = learning_rate / alpha)
criterion_train = nn.CrossEntropyLoss()
.......
loss = alpha * criterion_train(train_labels_head, train_labels)
the result of the training is different (of course I am comparing the same training with all the seeds set to the same value).
- In theory, scaling the loss and the learning rate should not produce any differences since they will cancel each other, is it correct?
- Is it because of the limited number precision?