Best way to cache large pre-processed dataset in disk

After a long pre-processing procedure (~1h), I generate ~3.5 million samples to train a network in an image segmentation task. This dataset is too large to fit into memory, so I’d like to cache the pre-processed samples in disk. Saving one file per sample is very inefficient, taking over 50 GB in disk.

What is the recommended/best way to solve this problem? Does PyTorch have something like TFRecord? Is there a ready-to-use library to cache large batches of samples?

All the solution I thought basically involving coding a solution from scratch.


I’m not aware at all about how TF records work. Have you think of using numpy mmap? if the images are same size you can create a big numpy array and read it on-demand and even make your own cache in ram.

Another option to optimize the process could be using nvidia dali