The Adamw paper says the Adam with weight decay looks like

And the corresponding pytorch implementation is

```
# Perform stepweight decay
p.data.mul_(1 - group['lr'] * group['weight_decay'])
```

I’m stuck by how line 12 in **Algorithm 2**(adamw) comes to the pytorch version.

I googled for a while and found that fast.ai published a post AdamW and Super-convergence is now the fastest way to train neural nets

, where it concluded that `Adamw`

might be implemented in some way like

```
loss.backward()
for group in optimizer.param_groups():
for param in group['params']:
param.data = param.data.add(-wd * group['lr'], param.data)
optimizer.step()
```

Am I missing something in order to derive from Algorithm 2 to the pytorch implementation?

Thank you for any elaborations.