I am writing my own pipeline, which sequentially generates data samples that consist of an np.array (with 3 dimensions ~ (100x100x15)), and a metadata dict (containing booleans, floats, integers and etc.). A single data sample with its metadata weighs ~600kb. That means that 10k samples are 6Gb in space. I am o.k with that during the saving of the pipeline, but during training of the model I don’t want to load 6Gb every time.
Is there a good format that would allow me to save files containign 6Gb of information, but during reading not load it all at once?
I know of np.memmap, but it would only work with pure tensors, and not with the metadata. And since I don’t want to separate images from their metadata, this is not a good option for me.
I use hdf5 file format to store/load large datasets and h5py, the python library to deal with hdf5 files. It might be dificult to deal with it at start, but gets much better with experience
Thanks for the reply!
One question though: it says in the docs that " Groups work like dictionaries, and datasets work like NumPy arrays". In my case, the dataset should contain dictionaries (with an image and metadata keys). However, most examples I see have only integers/floats in them. Will hdf5 handle this well?
I’m not really sure whether it can store metadata, but what I generally do is create two numpy arrays, one for images and other for labels. If I want the label for i-th image, I simply do label[i].
For the sake of future readers, after learning about h5s I think this is the right solution. There are solutions for different dtypes, and it’s called “tables” @hrushi