Save torch tensors as hdf5

nabsabs · March 12, 2019, 12:24am

Hi guys!

I’m not sure if this is a PyTorch question but I want to save the 2nd last fc outputs from a pretrained vgg into an hdf5 array to load later on. The issue is I would need to save all tensor outputs as one chunk to use an hdf5 dataset (below) however I cannot seem to append tensors to h5 dataset without creating chunks. Does anyone know of an efficient way to save torch tensors into one chunk in an hdf5 file?

Any help appreciated!

# https://www.tinymind.com/learn/terms/hdf5

import h5py
import torch
import torch.utils.data as data

class H5Dataset(data.Dataset):

    def __init__(self, file_path):
        super(H5Dataset, self).__init__()
        h5_file = h5py.File(file_path)
        self.data = h5_file.get('data')
        self.target = h5_file.get('label')

    def __getitem__(self, index):            
        return (torch.from_numpy(self.data[index,:,:,:]).float(),
                torch.from_numpy(self.target[index,:,:,:]).float())

    def __len__(self):
        return self.data.shape[0]

imaluengo · March 12, 2019, 7:53am

HDF5 is not a great format for appending information over-time… It will end up generating a very large binary file to handle new data.

I’d recommend doing it for a fixed size. E.g. first create a dataset of a fixed size:

N = 100 # find the length of my dataset
data = h5_file.create_dataset('data', shape=(N, 3, 224, 224), dtype=np.float32, fillvalue=0)

Then populate it

for i in range(N):
    img = ... # load image
    data[i] = img

h5_file.close()

However, if you really really want to do resizable datasets (not recommended, size will exponentially grow), HDF5 and h5py supports it, e.g., replacing above:

data = h5_file.create_dataset('data', shape=(N, 3, 224, 224), dtype=np.float32, 
                              maxshape=(None, 3, 224, 224))

And then at any time you can call the resize function of a dataset:

data.resize(10000, axis=0) # now you can fit up to 10K samples!

Just be careful not calling resize too many times

PS: Make sure you open the HDF5 file in read-only for better performance, and add swmr flag to allow concurrent reads, for which the .h5 file has had to be created with swmr too:

h5_file = h5.File(file_path, 'r', libver='latest', swmr=True)

nabsabs · March 12, 2019, 3:56pm

@imaluengo this is awesome thanks so much! I didn’t know I could just keep the H5file open and add to it because when I tried appending fc2 output tensors into a list, I ran out of memory. I guess the h5 file is compressing it when it stores it in the dataset? I’ll read into this some more but thank you again

imaluengo · March 13, 2019, 9:16am

Yup appending doesn’t work great in HDF5.

Resize + inserting data with slices should work fine tho. Just be aware that resizing operation does not physically extend the dataset memory in disk, it does create a separate storage blob and link them under the hood.

Thus, the more you resize, the slower it will be to do consecutive reads, e.g.

data = h5_file['data'][100:200]

That will return 100 elements however, if first 50 elements were initially created, and last 50 are created later as part of a resizing operation, that data retrieval query above will have a decrease in performance, as data is not contiguous in disk.

So, just try to minimize resize calls

wolterlw · June 24, 2021, 9:57pm

Does anyone have any experience storing images in hdf5? I mean compressing the image into a byte string and then writing it into the dataset, then read and decode it when reading.
Would be interesting to compare if this approach would be any faster than filesystem storage.