How to confirm DDP collects all GPU card(rank) results at each epoch

I ran the example in What is Distributed Data Parallel (DDP) — PyTorch Tutorials 2.1.0+cu121 documentation in single node with 4 gpu cards. As shown the video in the example, I expected at each epoch, all ranks must be performed. However, it prints out all epochs are done at each GPU sequentially.
i.e. For 10 epochs, it prints as follows:

[GPU3] Epoch 0
[GPU3] Epoch 1
:
[GPU3] Epoch 8
[GPU3] Epoch 9
[GPU1] Epoch 0
[GPU1] Epoch 1
:
[GPU1] Epoch 8
[GPU1] Epoch 9
[GPU2] Epoch 0
[GPU2] Epoch 1
:
[GPU2] Epoch 8
[GPU2] Epoch 9
[GPU0] Epoch 0
[GPU0] Epoch 1
:
[GPU0] Epoch 8
[GPU0] Epoch 9

Is there anyway to confirm DDP takes into account all GPU card results correctly?
The python env is python 3.11.2, torch 2.0.1, cuda 11.7.
Thank you for your help.

When printing the model parameter values after opitmizer.step in each batch, it prints out the same values from all ranks. So, DDP ring-allreduce algorithm seems to work. But, it is still not clear how it prints out sequentially all epochs for each rank.