Does DDP tolerate heterogeneity?

Vikings · March 4, 2020, 7:43pm

I’m currently implementing a heterogeneity-aware distributed model.
My basic idea is to do all-reduce on a subset of the fast workers in the world group, however I noticed that in torch.distributed. new_group, it says:

This function requires that all processes in the main group (i.e. all processes that are part of the distributed job) enter this function, even if they are not going to be members of the group.

Does that mean that the fast workers should wait for the slow workers before they can move forward (i.e. the slower one will sync the faster one)?
If so, is there any other way to implement the model?

mrshenli · March 6, 2020, 4:07pm

Does that mean that the fast workers should wait for the slow workers before they can move forward (i.e. the slower one will sync the faster one)?

No. It only requires all processes to call that function for rendezvous. After that, collective communications (e.g., allreduce) within a subgroup do not require non-member processes to join. So different subgroups can do allreduce independently.

For implementation, you could create multiple DDP gangs, on the same model, with each gang spans different processes. But then, you will need to coordinate the communication because different DDP gangs will read from and write to the same set of param.grad field. Application needs to avoid the race there.