I have found AdamW by LiyuanLucasLiu.
If I compare the implementation with the Adam, one thing is that I wonder…
Why AdamW implmentation used p_data_fp32 = p.data.float() and later on p.data.copy_(p_data_fp32).
Is this the placeholder trick for the optim to be memory efficient?
Will this improve the original Adam implementation, or this is not needed?