Training using DDP with world_size 4 on a multi-gpu machine runs with only two GPUs being used

I want to run a test to see how the synchronization works. I assume that at the end of each batch, DDP would wait for the processes on the world_size GPUs to reach the synchronization point like backward pass to synchronize gradients. If only 2 GPUS processes started, I assume that at the end of first batch, the synchronization on the existing 2 GPUS would time out as the other two never started. What I observed is that the training continued with only 2 GPU processes. How to explain this? Is my understanding not correct?

Sorry about the delay.

I assume that at the end of each batch, DDP would wait for the processes on the world_size GPUs to reach the synchronization point like backward pass to synchronize gradients. If only 2 GPUS processes started, I assume that at the end of first batch, the synchronization on the existing 2 GPUS would time out as the other two never started

Yes, this is correct.

What I observed is that the training continued with only 2 GPU processes. How to explain this? Is my understanding not correct?

It should block on the DDP construct or the backward call. Could you please share a code snippet that reproduces the above behavior?