Question about ZeRO optimizer

ConnollyLeon · March 11, 2021, 2:34am

I’m using ZeroRedundancyOptimizer in torch1.8. And I notice that in the step function of ZeRO, there is a update_param_groups before the self.optim.step().

I wonder if this function broadcast the gradients that the self.optim uses to calculate the new parameters. I don’t see any docstring explains where the gradients go.

Looking forward to any replies, please~~

ConnollyLeon · March 11, 2021, 3:55am

Ummm, I found the key. DDP has already done the parameters and gradients communications

rvarm1 · March 12, 2021, 9:26pm

Exactly, regardless of the optimizer, before step() is called it is guaranteed that all grad communication has taken place and each replica has allreduced gradients computed in the parameter’s .grad fields.