I was playing with AdamW. The default value for the weight decay (L2 regulation) is 0.01. I believe it is a wrong value. I tried to use the default value, and fail to train the model.
Some interesting points after carefully reading the paper of AdamW: https://arxiv.org/pdf/1711.05101v2.pdf
For a computation vision example, the recommended value is 0.00025: Figure 2
It suggests that for different computation tasks, the best weight decays are different. Therefore they proposed the concept of normalized weight decay, which you can use to compute the actual weight decay in formula 6. The normalized weight decay is much bigger than the weight decay.
I believe the 0.01 used in pytorch implementation of AdamW comes from the normalized weight decay. I suspect pytorch just uses that value directly as weight decay.
I hope this is helpful for other folks.