Weighting loss vs changing learning rate (using SGD)

j-piland · June 20, 2023, 5:07am

I want to be sure that weighting the loss in pytorch gets the same weights result as changing learning rate would.
The gradients attached to the weights will be different, but that’s fine.
This math works on paper, so I’m curious about the implementation.

In short, is Loss * 1 with lr = 2 the same as Loss * 2 with lr=1. EDIT: and would that apply to any equivalent pair?

Here is two short bits of code that demonstrate the behavior that I want:

With lr=2

import torch as t

inp = t.tensor([[.1,.2,.3],[.2,.3,.4],[.4,.5,.6]], requires_grad=True)
w1 = t.tensor([[.1,.2,.3],[.2,.3,.4],[.4,.5,.6]], requires_grad=True)
w2 = t.tensor([[.1,.2,.3],[.2,.3,.4],[.4,.5,.6]], requires_grad=True)

a = (inp*w1)
b = (a*w2)
L = b.sum()

L.backward()
opt = t.optim.SGD([w1,w2],lr=2)
opt.step()
print(w1,w2,sep='\n')
print(w1.grad,w2.grad,sep='\n')

With L*2

import torch as t

inp = t.tensor([[.1,.2,.3],[.2,.3,.4],[.4,.5,.6]], requires_grad=True)
w1 = t.tensor([[.1,.2,.3],[.2,.3,.4],[.4,.5,.6]], requires_grad=True)
w2 = t.tensor([[.1,.2,.3],[.2,.3,.4],[.4,.5,.6]], requires_grad=True)

a = (inp*w1)
b = (a*w2)
L = b.sum()

L = 2*L

L.backward()
opt = t.optim.SGD([w1,w2],lr=1)
opt.step()
print(w1,w2,sep='\n')
print(w1.grad,w2.grad,sep='\n')

ptrblck · June 20, 2023, 8:14am

I’m not sure if I understand this part correctly, but while your approach would work for a plain SGD optimizer, it will not work for more advanced ones, such as Adam.

j-piland · June 21, 2023, 2:54am

Thank you, I think that was what I was looking for.

Still, for clarity’s sake, I meant any pair of weight/learning rate combinations that result in the same value.
I.e., w=2 and lr=1 should be the same as w=1 and lr=2; w=3/lr=4 same as w=6/lr=2 and so on.

Also, thanks for the reminder that it only works with optimizers that aren’t changing the learning rate or doing other fancy things.