Question about ZeRO optimizer

I’m using ZeroRedundancyOptimizer in torch1.8. And I notice that in the step function of ZeRO, there is a update_param_groups before the self.optim.step().

I wonder if this function broadcast the gradients that the self.optim uses to calculate the new parameters. I don’t see any docstring explains where the gradients go.

Looking forward to any replies, please~~

Ummm, I found the key. DDP has already done the parameters and gradients communications

Exactly, regardless of the optimizer, before step() is called it is guaranteed that all grad communication has taken place and each replica has allreduced gradients computed in the parameter’s .grad fields.