What’s the best way to load large hdf5 data?

I have large hdf5 database, and have successfully resolved the thread-safety problem by enabling the SWARM feature of hdf5. However, using multiple worker to load my dataset still not achieve normal speed. Typically, I observe the GPU utility circularly rise up to 100%, then drop down to 1%.
Here is my dataset code (seems very naive):

class HDF5Dataset(Dataset):
    """
    Args:
        h5data (HDF5 dataset): HDF5 dataset object
    """
    def __init__(self, h5data):
        self.h5data = h5data
        
    def __getitem__(self, index):
        return self.h5data[index, ...]

    def __len__(self):
        return len(self.h5data)

then I use the above code this way

f = h5py.File('Mydata.h5', 'r', libver='latest', swmr=True)
h5data = f['Data']
dataset = HDF5Dataset(h5data)
train_loader = DataLoader(dataset, batch_size=4, num_workers=4)

how can i solve this problem? what’s the best pratice to load large hdf5 datasets in pytorch? or I should follow What’s the best way to load large data? to migrate my data into lmdb?

3 Likes

Hi, How did you solve this problem? I have encountered the same problem.
Thank you.

1 Like

Hi!

@Vandermode how did you enable SWARM feature of hdf5 and got you code snippet to work? I can’t hack it for a long long time… :confused:

Thanks,
Piotr

1 Like
class HDF5Dataset(Dataset):
    """
    Args:
        h5data (HDF5 dataset): HDF5 dataset object
    """
    def __init__(self, h5data_file):
        self.h5data_file = h5data_file
        
    def __getitem__(self, index):
        return self.h5data[index, ...]

    def __len__(self):
        return len(self.h5data)

    def h5py_worker_init(self):
        self.h5data = h5py.File(self.h5data_file, "r", libver="latest", swmr=True)
        atexit.register(self.cleanup)

    def cleanup(self):
        self.h5data.close()

def worker_init_fn(worker_id):
    worker_info = torch.utils.data.get_worker_info()
    dataset = worker_info.dataset
    dataset.h5py_worker_init()


dataset = HDF5Dataset("Mydata.h5")
DataLoader(dataset, batch_size=4, num_workers=4, worker_init_fn=worker_init_fn)
1 Like

Thanks a lot for your answer. Unfortunately this code does not work for me:

AttributeError: 'HDF5Dataset' object has no attribute 'h5data'

It seems like the h5data was not properly initialized. Do you have an idea how to fix this? Many thanks

very late reply as I seldom log in here. h5data is only initialized when a worker is initialized, which means this only works if you are actually using a pytorch DataLoader as shown at the bottom of the example, by passing the “worker_init_fn=worker_init_fn” in. Some extra work could make this Dataset more flexible.