loss.backward()
for group in optimizer.param_groups():
for param in group['params']:
param.data = param.data.add(-wd * group['lr'], param.data)
optimizer.step()
Am I missing something in order to derive from Algorithm 2 to the pytorch implementation?
Thank you! Your github comments really helps me understand the “hidden” thoughts, though I scratched my head for hours before I realize the thoughts. To verify the implementation, line 12 could be further expanded as
group[“lr”] = eta_{t} * apha because group[“lr”] might be lr scheduler, and thus group[“weight_decay”] = lambda / apha, when multiplied, the alphas get cancelled.
(Sorry for the ugly plain-text equations).
The actual weight decay parameter is scaled by initial learning rate (1/alpha). And we should take that in to consideration when tuning models.