Prefetch_factor not working

I had an issue at work where the questions was if I should stream the data from an S3 bucket via the Dataset class, or if I should download first and simply read it in. I was hoping that increasing prefetch_factor in dataloaders would increase the speed when streaming it via S3, and possibly even be an alternative to downloading.

In order to stream data instead of opening from disk (as is done in many tutorials) the dataset class was setup as the following:

class Data(Dataset):
    def __init__(self, prefix, transform):
        self.prefix = ""
        self.transform = transform
    def __len__(self):
        return 999
    def __getitem__(self, i):
        response = requests.get(self.prefix + f"/{i+1}.jpg")
        img =
        return self.transform(img)

As shown in the experiments done in this kaggle kernel, prefetch_factor flag did not speed things in a meaningful manner. The results are summarisd below. For each iteration the following code snippet was run, where model is simply resnet18.

with torch.inference_mode():
    for img_batch in tqdm(dl):
        out = model(
Settings Time Elapsed
num_workers = 2 04:02
num_workers = 2, prefetch_factor=8 03:57
num_workers = 8 1:01
num_workers = 8, prefetch_factor=8 1:01

All other parameters such as batch_size=32, pin_memory=True was held constant across all iterations.

Note that the reason we had 2 workers was due to the fact that this was the number given by multiprocessing.cpu_count(). However, going past that number in the last iteration and setting it at 8 worked, and gave the following ugly (repeated) warnings: Exception ignored in: Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__.

Any thoughts as to why prefetch_factor doesn’t work?

I’m not sure I understand why the prefetch_factor should speed up the data loading.
The docs explain this argument as:

Number of samples loaded in advance by each worker.

If your workers are not fast enough to preload data, increasing the prefetch_factor won’t help.

I see, I was hoping that it worked something like the tensorflow prefetch where it would load the next batch while the GPU was busy.

I might be wrong, but shouldn’t the fact that I increased num_workers (even though mp.cpu_count() is 2) and sped it up mean that there would be room for improvement via prefetching?

The DataLoader is already prefetching the data (the default for the prefetch_factor is set to 2) and (as you’ve pointed out) increasing the number of workers seems to improve the loading speed.
You could profile the data loading in isolation to see how the “vertical/horizontal” scaling plays out in your system (i.e. increasing the parallel workload via num_workers vs. storing more batches sequentially via increasing the prefetch_factor). Based on your experiments the latter doesn’t help and I don’t think you can universally expect a speedup by increasing the prefetch_factor.