I want to run a test to see how the synchronization works. I assume that at the end of each batch, DDP would wait for the processes on the world_size GPUs to reach the synchronization point like backward pass to synchronize gradients. If only 2 GPUS processes started, I assume that at the end of first batch, the synchronization on the existing 2 GPUS would time out as the other two never started. What I observed is that the training continued with only 2 GPU processes. How to explain this? Is my understanding not correct?