I am writing my own pipeline, which sequentially generates data samples that consist of an np.array (with 3 dimensions ~ (100x100x15)), and a metadata dict (containing booleans, floats, integers and etc.). A single data sample with its metadata weighs ~600kb. That means that 10k samples are 6Gb in space. I am o.k with that during the saving of the pipeline, but during training of the model I don’t want to load 6Gb every time.
Is there a good format that would allow me to save files containign 6Gb of information, but during reading not load it all at once?
I know of np.memmap, but it would only work with pure tensors, and not with the metadata. And since I don’t want to separate images from their metadata, this is not a good option for me.
Thanks for the reply!
One question though: it says in the docs that " Groups work like dictionaries, and datasets work like NumPy arrays". In my case, the dataset should contain dictionaries (with an image and metadata keys). However, most examples I see have only integers/floats in them. Will hdf5 handle this well?