SGD compatibility with other frameworks

They are actually equivalent when the learning rate does not change and given that initialisation is simply v = 0.
The only difference is when the learning rate changes. In this case, you would need to re-initialize the moment so that

v = old_lr / new_lr * v

to get equivalent behaviour (if I don’t mistake).