In Adam.py, I change
exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
to
exp_avg_sq=exp_avg_sq.mul(beta2).addcmul(1 - beta2, grad, grad)
The train loss behavior is very different(train loss increase). is there any difference except the memory efficiency?