Is there a benefit to cache images for data loading?

Hui_Liu · March 11, 2023, 9:11pm

Hi,

I see for yolov7 there is the default option to cache all images in memory for faster training.

However, based on this thread, torch.utils.data.dataloader.DataLoader will, by default, do prefetching? If so, the effect of caching images on training speed would be very minimal? Or I misunderstand how the prefetching works?

Thanks!!

ptrblck · March 11, 2023, 9:31pm

It would depend on the data loading speed and if you would see any bottlenecks or general wait times during the training if the DataLoader is used. If so, creating a cache might give you benefits assuming the cached images avoid the bottleneck in the first place.
However, you would need to check if the cache uses your system RAM (and if you would have enough) or the disk, as the latter might not give you huge benefits.

Hui_Liu · March 11, 2023, 10:29pm

I see. Thanks, Patrick! Yeah, I see slow-up without cache, but my whole dataset is too large to fit into the memory. So I was wondering if I could just prefetch and cache only the next batch. But then I see we already do that in dataloader?

ptrblck · March 12, 2023, 7:07am

Yes, the DataLoader will prefetch the next batches, but will not cache it.
I don’t know if your Yolo code could cache only a subset of the dataset to avoid running out of memory, but a quick check of the linked code seems to show all samples would be added to the cache.