What data format should I choose?

yonatansc97 · January 16, 2021, 7:27am

Hello,

I am writing my own pipeline, which sequentially generates data samples that consist of an np.array (with 3 dimensions ~ (100x100x15)), and a metadata dict (containing booleans, floats, integers and etc.). A single data sample with its metadata weighs ~600kb. That means that 10k samples are 6Gb in space. I am o.k with that during the saving of the pipeline, but during training of the model I don’t want to load 6Gb every time.
Is there a good format that would allow me to save files containign 6Gb of information, but during reading not load it all at once?
I know of np.memmap, but it would only work with pure tensors, and not with the metadata. And since I don’t want to separate images from their metadata, this is not a good option for me.

Thanks in advance,
Jonathan

hrushi · January 16, 2021, 7:44am

I use hdf5 file format to store/load large datasets and h5py, the python library to deal with hdf5 files. It might be dificult to deal with it at start, but gets much better with experience

yonatansc97 · January 16, 2021, 8:02am

Thanks for the reply!
One question though: it says in the docs that " Groups work like dictionaries, and datasets work like NumPy arrays". In my case, the dataset should contain dictionaries (with an image and metadata keys). However, most examples I see have only integers/floats in them. Will hdf5 handle this well?

hrushi · January 16, 2021, 8:20am

I’m not really sure whether it can store metadata, but what I generally do is create two numpy arrays, one for images and other for labels. If I want the label for i-th image, I simply do label[i].

yonatansc97 · January 19, 2021, 5:51am

For the sake of future readers, after learning about h5s I think this is the right solution. There are solutions for different dtypes, and it’s called “tables” @hrushi

hrushi · January 19, 2021, 9:01am

I use these two Modules to write/retrieve datasets into/from hdf5 format. Maybe these might help you.