Prefetch_factor not working

I had an issue at work where the questions was if I should stream the data from an S3 bucket via the Dataset class, or if I should download first and simply read it in. I was hoping that increasing prefetch_factor in dataloaders would increase the speed when streaming it via S3, and possibly even be an alternative to downloading.

In order to stream data instead of opening from disk (as is done in many tutorials) the dataset class was setup as the following:

class Data(Dataset):
    def __init__(self, prefix, transform):
        self.prefix = "https://aft-vbi-pds.s3.amazonaws.com/bin-images"
        self.transform = transform
        
    def __len__(self):
        return 999
    
    def __getitem__(self, i):
        response = requests.get(self.prefix + f"/{i+1}.jpg")
        img = Image.open(BytesIO(response.content))
        return self.transform(img)

As shown in the experiments done in this kaggle kernel, prefetch_factor flag did not speed things in a meaningful manner. The results are summarisd below. For each iteration the following code snippet was run, where model is simply resnet18.

with torch.inference_mode():
    for img_batch in tqdm(dl):
        out = model(img_batch.to(device))
Settings Time Elapsed
num_workers = 2 04:02
num_workers = 2, prefetch_factor=8 03:57
num_workers = 8 1:01
num_workers = 8, prefetch_factor=8 1:01

All other parameters such as batch_size=32, pin_memory=True was held constant across all iterations.

Note that the reason we had 2 workers was due to the fact that this was the number given by multiprocessing.cpu_count(). However, going past that number in the last iteration and setting it at 8 worked, and gave the following ugly (repeated) warnings: Exception ignored in: Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__.

Any thoughts as to why prefetch_factor doesn’t work?

I’m not sure I understand why the prefetch_factor should speed up the data loading.
The docs explain this argument as:

Number of samples loaded in advance by each worker.

If your workers are not fast enough to preload data, increasing the prefetch_factor won’t help.

I see, I was hoping that it worked something like the tensorflow prefetch where it would load the next batch while the GPU was busy.

I might be wrong, but shouldn’t the fact that I increased num_workers (even though mp.cpu_count() is 2) and sped it up mean that there would be room for improvement via prefetching?

The DataLoader is already prefetching the data (the default for the prefetch_factor is set to 2) and (as you’ve pointed out) increasing the number of workers seems to improve the loading speed.
You could profile the data loading in isolation to see how the “vertical/horizontal” scaling plays out in your system (i.e. increasing the parallel workload via num_workers vs. storing more batches sequentially via increasing the prefetch_factor). Based on your experiments the latter doesn’t help and I don’t think you can universally expect a speedup by increasing the prefetch_factor.

Hi @ptrblck, I don’t understand the exact use of the prefetch factor argument.

Acc. to the docs, it is Number of batches loaded in advance by each worker. My question is - since prefetching is a one-time thing done by the workers at the start of data loading, it might give a slight speedup for the initial batches to be loaded, but in the long run, the data loading by the workers would fail to catchup with the GPU speed and there would be stalls (when there are no more elements in the data queue) until the worker fetches the next batch.

therefore, num_workers argument makes sense as more the workers, the faster we would be loading into the data queue for the GPU to consume the batches, but prefetch_factor does not make sense to me.

Could you please help ? Thanks!

That’s not the case since each worker will start loading the next batch once the number of batches waiting in the queue is not meeting the prefetch factor.
This code snippet illustrates it.
As you can see the creation of the iterator will directly load prefetch_factor * num_workers = 2 * 8 = 16 batches, so nb_batches * batch_size = 16 * 2 = 32 samples.
The next call then returns a batch and you can see that one of the workers directly starts to load the next batch (2 samples) to fill the queue.

2 Likes

Ok makes sense now. So, the main use of prefetch_factor is to ensure that we always have prefetch_factor * num_workers number of batches in the data queue for the GPU to consume right?

1 Like

Yes, this is the idea. So even if some training iterations might be faster (for some reason) the queue could still be filled and you should not see any slowdown due to the data loading and the workers will try to fill the queue again.

2 Likes