Long time spent loading data at the beginning of each epoch

PengZhenghao · September 2, 2018, 5:29pm

Hi there, I am doing a model whose input is videos. I use a large batch size (96)(therefore many .jpg files should be loaded, say 12288 pictures) and 32 workers in Dataloader. So I find an interesting phenomenon: that most of the time is spent at the beginning of each epoch (one epoch mean a complete scan of the dataset), like this:

|Epoch: [6][1/11], lr: 0.00100|Time 81.513 (16.666)|Data 79.471 (14.021)|Loss 1.1498 (1.0695)|Prec@1 65.625 (72.222)|Prec@5 89.583 (91.257)|
|---|---|---|---|---|---|
|Epoch: [6][2/11], lr: 0.00100|Time 3.260 (15.635)|Data 1.233 (13.038)|Loss 0.8580 (1.0525)|Prec@1 77.083 (72.613)|Prec@5 90.625 (91.206)|
|Epoch: [6][3/11], lr: 0.00100|Time 2.791 (14.717)|Data 0.000 (12.106)|Loss 0.9509 (1.0449)|Prec@1 69.792 (72.403)|Prec@5 90.625 (91.163)|
|Epoch: [6][4/11], lr: 0.00100|Time 2.658 (13.913)|Data 0.000 (11.299)|Loss 1.3641 (1.0670)|Prec@1 63.542 (71.789)|Prec@5 87.500 (90.909)|
|Epoch: [6][5/11], lr: 0.00100|Time 2.674 (13.211)|Data 0.000 (10.593)|Loss 0.8397 (1.0523)|Prec@1 78.125 (72.200)|Prec@5 92.708 (91.026)|
|Epoch: [6][6/11], lr: 0.00100|Time 2.638 (12.589)|Data 0.000 (9.970)|Loss 0.7121 (1.0316)|Prec@1 82.292 (72.814)|Prec@5 95.833 (91.318)|
|Epoch: [6][7/11], lr: 0.00100|Time 2.647 (12.037)|Data 0.000 (9.416)|Loss 1.0643 (1.0335)|Prec@1 71.875 (72.760)|Prec@5 89.583 (91.219)|
|Epoch: [6][8/11], lr: 0.00100|Time 2.850 (11.553)|Data 0.000 (8.921)|Loss 0.8783 (1.0250)|Prec@1 77.083 (72.994)|Prec@5 92.708 (91.299)|
|Epoch: [6][9/11], lr: 0.00100|Time 2.766 (11.114)|Data 0.000 (8.475)|Loss 1.1610 (1.0320)|Prec@1 68.750 (72.776)|Prec@5 88.542 (91.158)|
|Epoch: [6][10/11], lr: 0.00100|Time 2.663 (10.711)|Data 0.000 (8.071)|Loss 0.9689 (1.0289)|Prec@1 75.000 (72.885)|Prec@5 92.708 (91.233)|
|Epoch: [6][11/11], lr: 0.00100|Time 1.609 (10.298)|Data 0.000 (7.704)|Loss 1.5306 (1.0395)|Prec@1 64.286 (72.705)|Prec@5 83.333 (91.068)|
|Epoch: [7][1/11], lr: 0.00100|Time 67.395 (12.780)|Data 65.243 (10.206)|Loss 1.1724 (1.0455)|Prec@1 66.667 (72.429)|Prec@5 87.500 (90.905)|
|Epoch: [7][2/11], lr: 0.00100|Time 19.035 (13.041)|Data 16.957 (10.487)|Loss 0.9826 (1.0428)|Prec@1 69.792 (72.313)|Prec@5 93.750 (91.029)|
|Epoch: [7][3/11], lr: 0.00100|Time 2.656 (12.625)|Data 0.000 (10.068)|Loss 0.9412 (1.0385)|Prec@1 76.042 (72.469)|Prec@5 93.750 (91.143)|
|Epoch: [7][4/11], lr: 0.00100|Time 2.664 (12.242)|Data 0.000 (9.680)|Loss 1.1080 (1.0413)|Prec@1 77.083 (72.655)|Prec@5 91.667 (91.164)|
|Epoch: [7][5/11], lr: 0.00100|Time 2.791 (11.892)|Data 0.000 (9.322)|Loss 0.9605 (1.0382)|Prec@1 75.000 (72.746)|Prec@5 96.875 (91.385)|
|Epoch: [7][6/11], lr: 0.00100|Time 2.657 (11.562)|Data 0.000 (8.989)|Loss 1.1846 (1.0437)|Prec@1 71.875 (72.713)|Prec@5 85.417 (91.163)|
|Epoch: [7][7/11], lr: 0.00100|Time 2.818 (11.261)|Data 0.000 (8.679)|Loss 0.9796 (1.0414)|Prec@1 69.792 (72.608)|Prec@5 93.750 (91.256)|
|Epoch: [7][8/11], lr: 0.00100|Time 2.671 (10.974)|Data 0.000 (8.390)|Loss 0.9922 (1.0397)|Prec@1 69.792 (72.511)|Prec@5 90.625 (91.234)|
|Epoch: [7][9/11], lr: 0.00100|Time 2.671 (10.707)|Data 0.000 (8.119)|Loss 1.1123 (1.0421)|Prec@1 66.667 (72.315)|Prec@5 93.750 (91.318)|
|Epoch: [7][10/11], lr: 0.00100|Time 2.674 (10.456)|Data 0.000 (7.865)|Loss 1.0186 (1.0413)|Prec@1 73.958 (72.368)|Prec@5 92.708 (91.363)|
|Epoch: [7][11/11], lr: 0.00100|Time 1.597 (10.187)|Data 0.000 (7.627)|Loss 1.2137 (1.0437)|Prec@1 66.667 (72.289)|Prec@5 90.476 (91.351)|
|Epoch: [8][1/11], lr: 0.00100|Time 94.021 (12.653)|Data 91.101 (10.082)|Loss 0.9634 (1.0412)|Prec@1 77.083 (72.437)|Prec@5 92.708 (91.393)|
|Epoch: [8][2/11], lr: 0.00100|Time 10.186 (12.582)|Data 8.128 (10.026)|Loss 0.9580 (1.0387)|Prec@1 78.125 (72.608)|Prec@5 89.583 (91.338)|
|Epoch: [8][3/11], lr: 0.00100|Time 2.751 (12.309)|Data 0.001 (9.748)|Loss 1.0534 (1.0392)|Prec@1 72.917 (72.617)|Prec@5 89.583 (91.287)|
|Epoch: [8][4/11], lr: 0.00100|Time 2.665 (12.049)|Data 0.000 (9.484)|Loss 1.1317 (1.0418)|Prec@1 66.667 (72.448)|Prec@5 92.708 (91.327)|
|Epoch: [8][5/11], lr: 0.00100|Time 2.657 (11.801)|Data 0.000 (9.235)|Loss 1.0215 (1.0412)|Prec@1 70.833 (72.404)|Prec@5 94.792 (91.423)|
|Epoch: [8][6/11], lr: 0.00100|Time 2.814 (11.571)|Data 0.000 (8.998)|Loss 1.2059 (1.0457)|Prec@1 70.833 (72.362)|Prec@5 91.667 (91.429)|
|Epoch: [8][7/11], lr: 0.00100|Time 2.774 (11.351)|Data 0.001 (8.773)|Loss 0.9423 (1.0430)|Prec@1 72.917 (72.376)|Prec@5 90.625 (91.408)|
|Epoch: [8][8/11], lr: 0.00100|Time 2.672 (11.139)|Data 0.000 (8.559)|Loss 0.9794 (1.0413)|Prec@1 66.667 (72.231)|Prec@5 90.625 (91.388)|
|Epoch: [8][9/11], lr: 0.00100|Time 2.666 (10.938)|Data 0.000 (8.355)|Loss 0.9243 (1.0384)|Prec@1 73.958 (72.274)|Prec@5 89.583 (91.344)|
|Epoch: [8][10/11], lr: 0.00100|Time 2.671 (10.745)|Data 0.000 (8.161)|Loss 1.2551 (1.0437)|Prec@1 72.917 (72.289)|Prec@5 90.625 (91.326)|
|Epoch: [8][11/11], lr: 0.00100|Time 1.616 (10.538)|Data 0.000 (7.976)|Loss 0.8046 (1.0412)|Prec@1 66.667 (72.231)|Prec@5 100.000 (91.417)|

As you can see, the training procedure take marginal time, while most of the time spent at the beginning of an epoch.

Is that an expected phenomena?

Ranahanocka · September 2, 2018, 10:35pm

You are using a lot of workers, it takes a lot of time to bring up 32 workers. How many cores do you have in your CPU? Can you try lowering to maybe 4 workers, and see if the issue persists?

PengZhenghao · September 3, 2018, 3:16am

I have 32 cpus in the server. I try num_workers=4,8,16,32,48,96, but in all circumstance the data loading time is still 80s+.

Ranahanocka · September 3, 2018, 3:39am

In that case, can you post your data loading function? I guess if you need to call a heavy init() at the beginning of every epoch it could also be the bottleneck (but I’m not sure how your code is defined).

I recommend running cProfiler to really understand what the issue is.

justusschock · September 3, 2018, 7:28am

Is your data stored on a ssd or hdd? Are you preloading all samples or just in time when you need them? How large is one image?

elistevens · September 8, 2018, 4:47am

What does the 1/11 signify?

How many iterations through your training loop does it take to exhaust the data loader?

Keep in mind that the data loader is going to assign an entire batch to each worker, and with 32 workers you’re going to be hammering the disk pretty hard. Once you have the first batch ready to go, the next 31 should also finish right around the same time. That means that batch two (through 32) should be ready with zero latency, and the first loader process should be working on the 33rd, which is why everything but the first batch is quick.

You can try having two data loaders active at the same time, so that when you get to epoch 2, the second DL already has everything loaded up and ready to go. Something like:

this_dl = DataLoader(..., num_workers=16)
next_dl = DataLoader(..., num_workers=16)
this_dl_iter = iter(this_dl)
next_dl_iter = iter(next_dl)

for epoch in range(epochs):
    for batch in this_dl_iter:
        # ...training, etc...
        this_dl_iter = next_dl_iter
        this_dl, next_dl = next_dl, this_dl
        next_dl_iter = iter(next_dl)

Note: I’m assuming without checking that trying to iterate over the same DL twice at the same time won’t work. Double check that.