Does a loss divided by n equivalent with learning rate / n?

John1231983 · May 17, 2018, 11:40am

Hello all, if my code is

n=10
learning_rate=1e-2
criterion= nn.BCEWithLogitsLoss()
optimizer = optim.SGD(net.parameters(), lr=learning_rate, momentum=0.9, weight_decay=1e-6)
loss= criterion(outputs, targets)
loss=loss / n
loss.backward()

Does the loss equivalent with the below by using learning rate / n?

n=10
learning_rate=1e-3  #1e-2 /n
criterion= nn.BCEWithLogitsLoss()
optimizer = optim.SGD(net.parameters(), lr=learning_rate, momentum=0.9, weight_decay=1e-6)
loss= criterion(outputs, targets)
loss.backward()

I am using SGD. IF not, what should I change in the first code without using loss=loss/n? Thanks

albanD · May 17, 2018, 12:09pm

IF you’re using sgd, yes.

John1231983 · May 17, 2018, 12:26pm

Thanks for your reply. I have updated more detail in the question. Actually, I used SGD+momentum. Let me know if you want to change your answer with the updated question

albanD · May 17, 2018, 12:50pm

If you’re using momentum and/or weight decay (or another optimizer) it won’t work.
For example SGD+weight decay (with learning rate lr and weight_decay wd will do the following update: w = w - lr * (dL/dw + wd*w). You can see that scaling your loss: dL/dw -> 1/n*dL/dw will not have the same effect as changing the learning rate lr -> 1/n*lr: the weight decay term will not be scaled the same way.

For other optimizer, you will need to check the formula of the update and see if scaling of the gradient and of the learning rate have the same effect on the update or not (it is really unlikely).

dashesy · June 20, 2018, 12:05am

@albanD Please correct me if I am wrong. it seems like this is because of the way PyTorch’s SGD is different from other frameworks (e.g. caffe)

I am porting a network from caffe and am trying to understand why after/if I increase the lr (after certain epoch), network always becomes unstable (inf weights and nan loss).
~~~It seems like PyTorch’s SGD is more sensitive to lr changes because it is applied to velocity instead of gradients. Is there any particular reason for this choice?~~~
EDIT: I added a SGD that is more like other frameworks and still if I increase lr, the network becomes unstable. Decreasing lr is always fine.