Efficiently saving and loading data using h5py (or other methods)

Fibbi · March 23, 2020, 5:10pm

I am testing ways of efficient saving and retrieving data using h5py. But am having trouble with running time while not using up all my memory.

In my first method I simply create a static h5py file

with h5py.File(fileName, 'w') as f:
        f.create_dataset('data_X', data = X, dtype = 'float32')
        f.create_dataset('data_y', data = y, dtype = 'float32')

In the second method, I set parameter maxshape in order to append more training data in the future. (see https://stackoverflow.com/questions/47072859/how-to-append-data-to-one-specific-dataset-in-a-hdf5-file-with-h5py)

with h5py.File(fileName2, 'w') as f:
            f.create_dataset('data_X', data = X, dtype = 'float32',maxshape=(None,4919))
            f.create_dataset('data_y', data = y, dtype = 'float32',maxshape=(None,6))

I am using PyTorch and am set up my data loader as such:

class H5Dataset_all(torch.utils.data.Dataset):
    def __init__(self, h5_path):
        # super(dataset_h5, self).__init__()
        self.h5_path = h5_path
        self._h5_gen = None
    
    def __getitem__(self, index):
        if self._h5_gen is None:
            self._h5_gen = self._get_generator()
            next(self._h5_gen)
        return self._h5_gen.send(index)
    
    def _get_generator(self):
        with h5py.File( self.h5_path, 'r') as record:
            index = yield
            while True:
                X = record['data_X'][index]
                y = record['data_y'][index]
                index = yield X, y
    
    def __len__(self):
        with h5py.File(self.h5_path,'r') as record:
            length = record['data_X'].shape[0]
            return length

loader = Data.DataLoader(
        dataset=H5Dataset_all(filename), 
        batch_size=BATCH_SIZE, 
        shuffle=True, num_workers=0)

Having saved the same data for each of these methods I would expect them to be similar in running time, however that is not the case. The data I used has size X.shape=(200722,4919) and y.shape=(200772,6). The files are about 3.6 GB each.
I test the running time using:

import time
t0 = time.time()
for i, (X_batch, y_batch) in enumerate(loader):
    # assign a dummy value
    a = 0 
t1 = time.time()-t0
print(f'time: {t1}')

For the first method the running time is 83 s and for the second it is 1216 s, which In my mind doesn’t make sense. Can anyone help me figure out why?

Additionally I also tried saving/loading it as a torch file using torch.save and torch.load and passing the data to Data.TensorDataset before setting up the loader. This implementation runs significantly faster (about 3.7 s), but has the disadvantage of having to load the files before training, which could quickly be capped by my memory.

Is there a better way in which I can train somewhat fast while not using having to load all of the data before training?

ptrblck · March 28, 2020, 5:23am

I’m not deeply familiar with HDF5, but could you have a look at this post?
It might explain the slowdown due to the chunking of the resizable dataset.