I had an issue at work where the questions was if I should stream the data from an S3 bucket via the Dataset class, or if I should download first and simply read it in. I was hoping that increasing prefetch_factor in dataloaders would increase the speed when streaming it via S3, and possibly even be an alternative to downloading.
In order to stream data instead of opening from disk (as is done in many tutorials) the dataset class was setup as the following:
As shown in the experiments done in this kaggle kernel, prefetch_factor flag did not speed things in a meaningful manner. The results are summarisd below. For each iteration the following code snippet was run, where model is simply resnet18.
for img_batch in tqdm(dl):
out = model(img_batch.to(device))
num_workers = 2
num_workers = 2, prefetch_factor=8
num_workers = 8
num_workers = 8, prefetch_factor=8
All other parameters such as batch_size=32, pin_memory=True was held constant across all iterations.
Note that the reason we had 2 workers was due to the fact that this was the number given by multiprocessing.cpu_count(). However, going past that number in the last iteration and setting it at 8 worked, and gave the following ugly (repeated) warnings: Exception ignored in: Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__.
Any thoughts as to why prefetch_factor doesn’t work?
The DataLoader is already prefetching the data (the default for the prefetch_factor is set to 2) and (as you’ve pointed out) increasing the number of workers seems to improve the loading speed.
You could profile the data loading in isolation to see how the “vertical/horizontal” scaling plays out in your system (i.e. increasing the parallel workload via num_workers vs. storing more batches sequentially via increasing the prefetch_factor). Based on your experiments the latter doesn’t help and I don’t think you can universally expect a speedup by increasing the prefetch_factor.
Hi @ptrblck, I don’t understand the exact use of the prefetch factor argument.
Acc. to the docs, it is Number of batches loaded in advance by each worker. My question is - since prefetching is a one-time thing done by the workers at the start of data loading, it might give a slight speedup for the initial batches to be loaded, but in the long run, the data loading by the workers would fail to catchup with the GPU speed and there would be stalls (when there are no more elements in the data queue) until the worker fetches the next batch.
therefore, num_workers argument makes sense as more the workers, the faster we would be loading into the data queue for the GPU to consume the batches, but prefetch_factor does not make sense to me.
That’s not the case since each worker will start loading the next batch once the number of batches waiting in the queue is not meeting the prefetch factor. This code snippet illustrates it.
As you can see the creation of the iterator will directly load prefetch_factor * num_workers = 2 * 8 = 16 batches, so nb_batches * batch_size = 16 * 2 = 32 samples.
The next call then returns a batch and you can see that one of the workers directly starts to load the next batch (2 samples) to fill the queue.
Yes, this is the idea. So even if some training iterations might be faster (for some reason) the queue could still be filled and you should not see any slowdown due to the data loading and the workers will try to fill the queue again.