Hi,
I am writing a customer optimizer that needs to be able to access the global gradient from the previous iteration, as well as sync parameters across optimizer instances.
What would be the best way to do this?
In a non-distributed context, I simply stored the grads in the state variable, as ADAM does with state[‘exp_avg_sq’].
Should I be calling my own all_reduce during the training loop?
Thanks!