According to the documentation of AdamW[doc], it seems that this implementation of AdamW will be invariant to the case of loss function time a positive number.
But it does not behave like the documentation, in our test as,
criterion1 = nn.CrossEntropyLoss()
criterion2 = nn.CrossEntropyLoss()
optimizer1 = optim.AdamW(net1.parameters(), lr=0.0001, betas=(0.1,0.1), weight_decay=0.9,eps=1e-08)
optimizer2 = optim.AdamW(net2.parameters(), lr=0.0001, betas=(0.1,0.1), weight_decay=0.9,eps=1e-08*1000)
# eps time a same scale
input2 = inputs.clone()
label2 = labels.clone()
# zero the parameter gradients
# forward + backward + optimize
output2 = net2(input2)
loss2 = criterion2(output2, label2)*s
# loss2 time the same scale
AdamW does appear to be invariant to the “scale” of the loss function.
Is it possible that you are not initializing net1 and net2 identically?
Typically your network weights will be initialized randomly, and if you
don’t reset the pseudorandom-number generator (or take other measures), net1 and net2 won’t start out the same.
Here is a self-contained, runnable script that illustrates AdamW's
Thanks, Frank! I can get it from your code, but even if we changed my code following your setting, we still cannot get the equivalent results like yours.
Could you please help me find out which operation may cause this difference? [Torch_ls.ipynb]
Appreciate it very much.