I want to run a test to see how the synchronization works. I assume that at the end of each batch, DDP would wait for the processes on the world_size GPUs to reach the synchronization point like backward pass to synchronize gradients. If only 2 GPUS processes started, I assume that at the end of first batch, the synchronization on the existing 2 GPUS would time out as the other two never started. What I observed is that the training continued with only 2 GPU processes. How to explain this? Is my understanding not correct?
Sorry about the delay.
I assume that at the end of each batch, DDP would wait for the processes on the world_size GPUs to reach the synchronization point like backward pass to synchronize gradients. If only 2 GPUS processes started, I assume that at the end of first batch, the synchronization on the existing 2 GPUS would time out as the other two never started
Yes, this is correct.
What I observed is that the training continued with only 2 GPU processes. How to explain this? Is my understanding not correct?
It should block on the DDP construct or the backward call. Could you please share a code snippet that reproduces the above behavior?