Is AdamW will be invariant as if the loss function is time a positive number?

songyuc · March 10, 2022, 9:19am

Hi, guys,
According to the documentation of AdamW [doc], it seems that this implementation of AdamW will be invariant to the case of loss function time a positive number.
But it does not behave like the documentation, in our test as,

...
criterion1 = nn.CrossEntropyLoss()
criterion2 = nn.CrossEntropyLoss()
optimizer1 = optim.AdamW(net1.parameters(), lr=0.0001, betas=(0.1,0.1), weight_decay=0.9,eps=1e-08)
optimizer2 = optim.AdamW(net2.parameters(), lr=0.0001, betas=(0.1,0.1), weight_decay=0.9,eps=1e-08*1000)
# eps time a same scale
...

input2 = inputs.clone()
label2 = labels.clone()
# zero the parameter gradients
optimizer2.zero_grad()
# forward + backward + optimize
output2 = net2(input2)
loss2 = criterion2(output2, label2)*s
# loss2 time the same scale
loss2.backward()
optimizer2.step()
...

[Torch_ls.ipynb]
So, how can we explain this behavior?

Your answer and guide will be appreciated!

KFrank · March 10, 2022, 6:48pm

Hi Song!

AdamW does appear to be invariant to the “scale” of the loss function.

Is it possible that you are not initializing net1 and net2 identically?
Typically your network weights will be initialized randomly, and if you
don’t reset the pseudorandom-number generator (or take other measures),
net1 and net2 won’t start out the same.

Here is a self-contained, runnable script that illustrates AdamW’s
invariance:

import torch
print (torch.__version__)

_ = torch.manual_seed (2022)

inputs = torch.randn (10, 5)
labels = torch.randint (3, (10,))

_ = torch.manual_seed (12345)   # set seed used to initialize net1
net1 = torch.nn.Linear (5, 3)
_ = torch.manual_seed (12345)   # reset seed so that net2 is initialized identically to net1
net2 = torch.nn.Linear (5, 3)

print ('check weight:', torch.equal (net1.weight, net2.weight))
print ('check bias:', torch.equal (net1.bias, net2.bias))

scale = 1000.0

optimizer1 = torch.optim.AdamW (net1.parameters(), lr = 0.0001, betas = (0.1,0.1), weight_decay = 0.9, eps = 1e-08)
optimizer2 = torch.optim.AdamW (net2.parameters(), lr = 0.0001, betas = (0.1,0.1), weight_decay = 0.9, eps = scale * 1e-08)

for  i in range (5):
    optimizer1.zero_grad()
    optimizer2.zero_grad()
    loss1 = torch.nn.CrossEntropyLoss() (net1 (inputs), labels)
    loss2 = scale * torch.nn.CrossEntropyLoss() (net2 (inputs), labels)
    loss1.backward()
    loss2.backward()
    optimizer1.step()
    optimizer2.step()

print ('check weight:', torch.equal (net1.weight, net2.weight))
print ('check bias:', torch.equal (net1.bias, net2.bias))

And here is its output:

1.10.2
check weight: True
check bias: True
check weight: True
check bias: True

Best.

K. Frank

songyuc · March 11, 2022, 10:01am

Thanks, Frank! I can get it from your code, but even if we changed my code following your setting, we still cannot get the equivalent results like yours.
Could you please help me find out which operation may cause this difference?
[Torch_ls.ipynb]
Appreciate it very much.