I have found AdamW by LiyuanLucasLiu.
If I compare the implementation with the Adam, one thing is that I wonder…
Why AdamW implmentation used
p_data_fp32 = p.data.float() and later on
Is this the placeholder trick for the optim to be memory efficient?
Will this improve the original Adam implementation, or this is not needed?