Save torch tensors as hdf5

imaluengo · March 12, 2019, 7:53am

HDF5 is not a great format for appending information over-time… It will end up generating a very large binary file to handle new data.

I’d recommend doing it for a fixed size. E.g. first create a dataset of a fixed size:

N = 100 # find the length of my dataset
data = h5_file.create_dataset('data', shape=(N, 3, 224, 224), dtype=np.float32, fillvalue=0)

Then populate it

for i in range(N):
    img = ... # load image
    data[i] = img

h5_file.close()

However, if you really really want to do resizable datasets (not recommended, size will exponentially grow), HDF5 and h5py supports it, e.g., replacing above:

data = h5_file.create_dataset('data', shape=(N, 3, 224, 224), dtype=np.float32, 
                              maxshape=(None, 3, 224, 224))

And then at any time you can call the resize function of a dataset:

data.resize(10000, axis=0) # now you can fit up to 10K samples!

Just be careful not calling resize too many times

PS: Make sure you open the HDF5 file in read-only for better performance, and add swmr flag to allow concurrent reads, for which the .h5 file has had to be created with swmr too:

h5_file = h5.File(file_path, 'r', libver='latest', swmr=True)