I ran the example in What is Distributed Data Parallel (DDP) — PyTorch Tutorials 2.1.0+cu121 documentation in single node with 4 gpu cards. As shown the video in the example, I expected at each epoch, all ranks must be performed. However, it prints out all epochs are done at each GPU sequentially.
i.e. For 10 epochs, it prints as follows:
[GPU3] Epoch 0
[GPU3] Epoch 1
:
[GPU3] Epoch 8
[GPU3] Epoch 9
[GPU1] Epoch 0
[GPU1] Epoch 1
:
[GPU1] Epoch 8
[GPU1] Epoch 9
[GPU2] Epoch 0
[GPU2] Epoch 1
:
[GPU2] Epoch 8
[GPU2] Epoch 9
[GPU0] Epoch 0
[GPU0] Epoch 1
:
[GPU0] Epoch 8
[GPU0] Epoch 9
Is there anyway to confirm DDP takes into account all GPU card results correctly?
The python env is python 3.11.2, torch 2.0.1, cuda 11.7.
Thank you for your help.