Best way to cache large pre-processed dataset in disk

After a long pre-processing procedure (~1h), I generate ~3.5 million samples to train a network in an image segmentation task. This dataset is too large to fit into memory, so I’d like to cache the pre-processed samples in disk. Saving one file per sample is very inefficient, taking over 50 GB in disk.

What is the recommended/best way to solve this problem? Does PyTorch have something like TFRecord? Is there a ready-to-use library to cache large batches of samples?

All the solution I thought basically involving coding a solution from scratch.

Thanks!

Hi,
I’m not aware at all about how TF records work. Have you think of using numpy mmap? if the images are same size you can create a big numpy array and read it on-demand and even make your own cache in ram.

Another option to optimize the process could be using nvidia dali https://developer.nvidia.com/DALI