Ideally, we should address this in DDP and close https://github.com/pytorch/pytorch/issues/38174. Before that takes place, you can use all_reduce
synchronize some signal across all processes. See Multiprocessing - Barrier Blocks all Processes?
One thing to note is that, this might have perf impacts, especially when the model is light and its forward pass runs faster than communicating the signal.