Drastically slow speed at first epoch

Hi,

I have been experiencing this problem for a while, and it only happens with somewhat large datasets > 1M.
The training speed at the first epoch is about 5~6 times slower than the next epochs, and it is not just the first few iterations (cold start). The speed gradually gets faster, but on average, it is about 5~6 times slower.

Do any of you have an idea of what could be happening?
I am using a data loader based on image list files:

Thanks a lot in advance.

2 Likes

It could be disk IO problem. Could you monitor the time to load each data?

I think so, too. I just don’t know how to fix this…

Another evidence is that when I start running another instance of training (that uses the same dataset) on the other GPU, the one that was previously running gets affected and slows down.

It alternates between 10-20 seconds and 1-4 seconds per loading each batch during the first epoch. In the next epoch, it maintains 0.3 seconds per batch, which seems normal.

a couple of ideas:

  1. if data is on hdd, maybe defragment?
  2. move the dataset to a drive that no other process uses
  3. increase num_workers

Thanks for the help! I will try those.

1 Like

Increasing number of workers did not help. I will stick with your second solution. By the way, could you give me an insight on why this would not occur in the next epochs, but just during the first epoch?

Sorry, I’m not certain what exactly happens either.

I am experiencing the same issue. When I switch to validation and switch back, the beginning few hundred iterations is always very slow, about 4~5 times slower than normal. Some example statistics from the training process:

2018-10-17 20:12:12 Validation: max_iter: 1725, loss:1.191111, Validation cer: 0.065339, accuracyCTC: 0.625263
2018-10-17 20:16:01 Train: [26/150][430500/2598450] Loss: 0.408760 Loss_CTC: 0.408760
time elapsed 1747
2018-10-17 20:19:37 Train: [26/150][431000/2598450] Loss: 0.353606 Loss_CTC: 0.353606
time elapsed 216
2018-10-17 20:23:46 Train: [26/150][431500/2598450] Loss: 0.396339 Loss_CTC: 0.396339
time elapsed 249
2018-10-17 20:30:23 Train: [26/150][432000/2598450] Loss: 0.379618 Loss_CTC: 0.379618
time elapsed 396
2018-10-17 20:35:08 Train: [27/150][432500/2598450] Loss: 0.395716 Loss_CTC: 0.395716
time elapsed 285
2018-10-17 20:40:20 Train: [27/150][433000/2598450] Loss: 0.430345 Loss_CTC: 0.430345
time elapsed 311
2018-10-17 20:45:49 Train: [27/150][433500/2598450] Loss: 0.451907 Loss_CTC: 0.451907
time elapsed 328
2018-10-17 20:49:50 Train: [27/150][434000/2598450] Loss: 0.393155 Loss_CTC: 0.393155
time elapsed 240
2018-10-17 20:54:10 Train: [27/150][434500/2598450] Loss: 0.375856 Loss_CTC: 0.375856
time elapsed 260
2018-10-17 20:59:41 Train: [27/150][435000/2598450] Loss: 0.374424 Loss_CTC: 0.374424
time elapsed 330
2018-10-17 21:24:42 Validation: max_iter: 1725, loss:1.233052, Validation cer: 0.065245, accuracyCTC: 0.623922
2018-10-17 21:29:18 Train: [27/150][435500/2598450] Loss: 0.400490 Loss_CTC: 0.400490
time elapsed 1777
2018-10-17 21:34:05 Train: [27/150][436000/2598450] Loss: 0.559940 Loss_CTC: 0.559940
time elapsed 286
2018-10-17 21:38:16 Train: [27/150][436500/2598450] Loss: 0.469486 Loss_CTC: 0.469486
time elapsed 250
2018-10-17 21:42:23 Train: [27/150][437000/2598450] Loss: 0.460522 Loss_CTC: 0.460522
time elapsed 246
2018-10-17 21:46:36 Train: [27/150][437500/2598450] Loss: 0.417255 Loss_CTC: 0.417255
time elapsed 252
2018-10-17 21:50:29 Train: [27/150][438000/2598450] Loss: 0.446802 Loss_CTC: 0.446802
time elapsed 233
2018-10-17 21:54:54 Train: [27/150][438500/2598450] Loss: 0.405387 Loss_CTC: 0.405387
time elapsed 264
2018-10-17 22:00:10 Train: [27/150][439000/2598450] Loss: 0.388070 Loss_CTC: 0.388070
time elapsed 316
2018-10-17 22:04:14 Train: [27/150][439500/2598450] Loss: 0.399813 Loss_CTC: 0.399813
time elapsed 243
2018-10-17 22:08:08 Train: [27/150][440000/2598450] Loss: 0.386462 Loss_CTC: 0.386462
time elapsed 233

Hi, I solved this problem through upload the data to the temporary folder in colab instead read the data from the google drive which is extremely slow in the first epoch. Below is the difference between read from temporary folder(/content/)and google drive.