Huge lag time between epochs

omarmetrx · March 6, 2023, 6:50pm

Hello,

I am running a training with torchrun --nnodes 1 --nproc_per_node 2 training.py. Each epoch step do parallalize and proceed in shorter time as expected. However, there is a delay about 20 mins before proceeding to the next epoch!

When running on single gpu (without torchrun), I don’t get this delay.

Any ideas to solve this is appreciated.

Thanks in advance!

agu · March 7, 2023, 10:58pm

Providing targeted help is difficult without a way to reproduce your issue

Do you have a minimal script? Could you provide some hardware information?

omarmetrx · March 8, 2023, 2:55am

Hi @agu.

I am using this script here tutorials/training.py at main · Project-MONAI/tutorials · GitHub.

It is basically a pytorch ignite wrapper.

For the hardware, I am using 2 A100 gpus.

What else I can provide?

J_Johnson · March 8, 2023, 6:18am

@omarmetrx
If you’re using some compilation of code written partially in PyTorch, you might try raising an issue on their GitHub page as they will be most familiar with their library.

One thing you might check is if you’re using all of the same versions of libraries they used when they developed the project, including CuDNN and CUDA.