They are actually equivalent when the learning rate does not change and given that initialisation is simply v = 0
.
The only difference is when the learning rate changes. In this case, you would need to re-initialize the moment so that
v = old_lr / new_lr * v
to get equivalent behaviour (if I don’t mistake).