Training hangs for a second at the beginning of each epoch

I used the command watch -n0.1 nvidia-smi to check the behavior of GPUs and
I found the utilization of GPU becomes 0% at the beginning of each epoch for a short period. I am just wondering if it is common that the training hangs for a second before the beginning of each epoch? Maybe the reason is that dataloader has to re-prepare data for the beginning of each epoch?

1 Like

That’s likely, yes. You can add some timing code in the body of your trainer to confirm this. For example, printing some output as soon as you get the first batch of input data can be used to prove/disprove the data loader being the cause of the slow start.

Thanks Pieter. I believe it has some slow down at the beginning. I am just wondering what’s the reason causing such delay.