In DDP training, how to do gradient synchronization among part of DDP nodes?

Asta · August 18, 2020, 6:41am

As far as I know, when in DDP(DistributedDataParallel), loss.backward()will synchronize gradient for all nodes in the group automatically through Reducer. However, If I do want to synchronize and update model parameters among part of nodes in some epochs, how can I manage to do that?

I would appreciate you for any hints or concrete code sample

mrshenli · August 18, 2020, 6:24pm

Hey @Asta

I see two options:

Option 1: create two DDP instances on each process and construct then using different ProcessGroup instances. One DDP instance can use the global ProcessGroup which will synchronize across all nodes, and another DDP instance can use a different ProcessGroup of a sub-group which is created using the new_group API.

Option 2: Use the DDP comm hook [code and example]. This is still a prototype feature and might change in the future.

One thing to mention is that, when you do this (sync gradients in subgroup), it will create gradient inconsistency across processes (as some process didn’t participate in some iteration). This would then lead to inconsistency in model replicas on different processes. DDP only broadcasts model in its ctor. To keep all model replicas consistent, it relies on the assumption that all processes see the same gradient in all iterations. So, if you do partial sync, you might also need to manually broadcast model to bring all processes back to sync, otherwise the result might be numerically incorrect.