Need explanations of adamw implementation

leengsmile · March 2, 2020, 12:20pm

The Adamw paper says the Adam with weight decay looks like

And the corresponding pytorch implementation is

# Perform stepweight decay
p.data.mul_(1 - group['lr'] * group['weight_decay'])

I’m stuck by how line 12 in Algorithm 2(adamw) comes to the pytorch version.

I googled for a while and found that fast.ai published a post AdamW and Super-convergence is now the fastest way to train neural nets
, where it concluded that Adamw might be implemented in some way like

loss.backward()
for group in optimizer.param_groups():
    for param in group['params']:
        param.data = param.data.add(-wd * group['lr'], param.data)
optimizer.step()

Am I missing something in order to derive from Algorithm 2 to the pytorch implementation?

Thank you for any elaborations.

vincentqb · March 2, 2020, 10:32pm

There was a discussion about this here. Is this what you were looking for?

leengsmile · March 6, 2020, 4:23pm

Thank you! Your github comments really helps me understand the “hidden” thoughts, though I scratched my head for hours before I realize the thoughts. To verify the implementation, line 12 could be further expanded as

theta_t <- theta_{t-1} - eta_{t} * apha \frac{.}{.} - eta_t * lambda * theta_{t-1}

group[“lr”] = eta_{t} * apha because group[“lr”] might be lr scheduler, and thus group[“weight_decay”] = lambda / apha, when multiplied, the alphas get cancelled.

(Sorry for the ugly plain-text equations).

The actual weight decay parameter is scaled by initial learning rate (1/alpha). And we should take that in to consideration when tuning models.

Correct me if I’m wrong.