Eliminate data loading pause between train and val loops

Connor_Anderson · September 22, 2020, 9:53pm

I’m wondering if anyone has tried to do this. When shifting from train to val or vice versa, there is a pause in training as the active DataLoader changes and begins filling up it’s queue of batches (I’m talking particularly about using the torch.utils.data.DataLoader with num_workers > 1). When using large batches, this pause can be significant. Is there some way to have the next dataloader start queuing batches as soon as the previous one is finished buffering; even if the model hasn’t processed all those batches yet?

tom · September 23, 2020, 8:49am

So the dataloaders are initialized when the iterator is created. You could look into creating the iterator right after loading your last batch of data with the other iterator.

The other question is whether it is worth the complexity you’d be introducing with it. As you note, this only affects the first batch of each iteration, so if you have a sizeable number of batches, it’ll contribute at most 1/number of batches to the total run time, and you get better leverage from considerably smaller optimizations that affect each batch (from where/when do I do augmentation or other processing to where/how do I store my data).
I’m saying this because in my experience it is very easy to be mislead by intuition when looking at optimizing the overall runtime. This bit is very visible during training (the progress bars aren’t ticking and all) but what kind of speedup of the overall time can you get from eliminating it?

Best regards

Thomas

Connor_Anderson · September 24, 2020, 8:39pm

Thanks Tom, for the suggestion and also for the good point about optimization.