How to use "break" in DistributedDataParallel training

mrshenli · July 8, 2020, 1:25am

Ideally, we should address this in DDP and close https://github.com/pytorch/pytorch/issues/38174. Before that takes place, you can use all_reduce synchronize some signal across all processes. See Multiprocessing - Barrier Blocks all Processes?

One thing to note is that, this might have perf impacts, especially when the model is light and its forward pass runs faster than communicating the signal.