Disabling all reduce in Distributed Data Parallel

DT6A · July 21, 2021, 4:09pm

Hello, I’m trying to setup distributed model training. Distributed Data Parallel documentation says that torch.nn.parallel.DistributedDataParallel performs all reduce operation by itself if I got it right. Is it possible to disable this functionality so I can call all reduce manually? Or in this case I must use something instead DistributedDataParallel?

wayi · July 21, 2021, 6:50pm

Is it possible to disable this functionality so I can call all reduce manually?

Do you need to implement any customized logic in allreduce? If so, I will recommend DDP comm hooks, which provides an interface to implement customized allreduce.

Another option is no_sync context manager, which will disable allreduce, then it’s your responsibility to run allreduce.

DT6A · July 22, 2021, 7:45am

Seems like no_sync is what I need. Thank you