Thanks a lot for pointing that out. Indeed, the momentum term would bring the gradients from previous steps.
If we zero the gradient at step t, for getting weight at t+1, we still have momentum t.
Thanks a lot for pointing that out. Indeed, the momentum term would bring the gradients from previous steps.
If we zero the gradient at step t, for getting weight at t+1, we still have momentum t.