I have been experiencing this problem for a while, and it only happens with somewhat large datasets > 1M.
The training speed at the first epoch is about 5~6 times slower than the next epochs, and it is not just the first few iterations (cold start). The speed gradually gets faster, but on average, it is about 5~6 times slower.
Do any of you have an idea of what could be happening?
I am using a data loader based on image list files:
I think so, too. I just don’t know how to fix this…
Another evidence is that when I start running another instance of training (that uses the same dataset) on the other GPU, the one that was previously running gets affected and slows down.
It alternates between 10-20 seconds and 1-4 seconds per loading each batch during the first epoch. In the next epoch, it maintains 0.3 seconds per batch, which seems normal.
Increasing number of workers did not help. I will stick with your second solution. By the way, could you give me an insight on why this would not occur in the next epochs, but just during the first epoch?
I am experiencing the same issue. When I switch to validation and switch back, the beginning few hundred iterations is always very slow, about 4~5 times slower than normal. Some example statistics from the training process:
Hi, I solved this problem through upload the data to the temporary folder in colab instead read the data from the google drive which is extremely slow in the first epoch. Below is the difference between read from temporary folder(/content/)and google drive.