What data format should I choose?


I am writing my own pipeline, which sequentially generates data samples that consist of an np.array (with 3 dimensions ~ (100x100x15)), and a metadata dict (containing booleans, floats, integers and etc.). A single data sample with its metadata weighs ~600kb. That means that 10k samples are 6Gb in space. I am o.k with that during the saving of the pipeline, but during training of the model I don’t want to load 6Gb every time.
Is there a good format that would allow me to save files containign 6Gb of information, but during reading not load it all at once?
I know of np.memmap, but it would only work with pure tensors, and not with the metadata. And since I don’t want to separate images from their metadata, this is not a good option for me.

Thanks in advance,

I use hdf5 file format to store/load large datasets and h5py, the python library to deal with hdf5 files. It might be dificult to deal with it at start, but gets much better with experience

Thanks for the reply!
One question though: it says in the docs that " Groups work like dictionaries, and datasets work like NumPy arrays". In my case, the dataset should contain dictionaries (with an image and metadata keys). However, most examples I see have only integers/floats in them. Will hdf5 handle this well?

I’m not really sure whether it can store metadata, but what I generally do is create two numpy arrays, one for images and other for labels. If I want the label for i-th image, I simply do label[i].

1 Like

For the sake of future readers, after learning about h5s I think this is the right solution. There are solutions for different dtypes, and it’s called “tables” @hrushi

1 Like

I use these two Modules to write/retrieve datasets into/from hdf5 format. Maybe these might help you.