After a long pre-processing procedure (~1h), I generate ~3.5 million samples to train a network in an image segmentation task. This dataset is too large to fit into memory, so I’d like to cache the pre-processed samples in disk. Saving one file per sample is very inefficient, taking over 50 GB in disk.
What is the recommended/best way to solve this problem? Does PyTorch have something like TFRecord? Is there a ready-to-use library to cache large batches of samples?
All the solution I thought basically involving coding a solution from scratch.
Thanks!