I have found AdamW by LiyuanLucasLiu.
If I compare the implementation with the Adam, one thing is that I wonder…
Why AdamW implmentation used p_data_fp32 = p.data.float()
and later on p.data.copy_(p_data_fp32)
.
Is this the placeholder trick for the optim to be memory efficient?
Will this improve the original Adam implementation, or this is not needed?