I am running a training with torchrun --nnodes 1 --nproc_per_node 2 training.py. Each epoch step do parallalize and proceed in shorter time as expected. However, there is a delay about 20 mins before proceeding to the next epoch!
When running on single gpu (without torchrun), I don’t get this delay.
@omarmetrx
If you’re using some compilation of code written partially in PyTorch, you might try raising an issue on their GitHub page as they will be most familiar with their library.
One thing you might check is if you’re using all of the same versions of libraries they used when they developed the project, including CuDNN and CUDA.