Recommend the way to load larger h5 files

SU801T · May 17, 2020, 1:02am

I was initially getting an OS B-Tree error when using multiple processes. So I followed the advice in this thread here:

And created a dataclass like this:

class Features_Dataset(data.Dataset):
    def __init__(self, archive, phase):
        self.archive = archive
        self.phase = phase

    def __getitem__(self, index):
        with h5py.File(self.archive, 'r', libver='latest', swmr=True) as archive:
            datum = archive[str(self.phase) + '_all_arrays'][index]
            label = archive[str(self.phase) + '_labels'][index]
            path = archive[str(self.phase) +  '_img_paths'][index]
            return datum, label, path

    def __len__(self):
        with h5py.File(self.archive, 'r', libver='latest', swmr=True) as archive:
            datum = archive[str(self.phase) + '_all_arrays']
            return len(datum)


if __name__ == '__main__':
    train_dataset = Features_Dataset(archive= "featuresdata/train.hdf5", phase= 'train')
    trainloader = data.DataLoader(train_dataset, num_workers=8, batch_size=128)
    print(len(trainloader))
    for i, (data, label, path) in enumerate(trainloader):
        print(path)

Now I don’t get an error anymore, but loading data is super slow. Because of that, the 4 GPUs that I’m trying to utilize are at zero % volatility. I think there should be a fix, or I have written something completely inefficient. I have 150k instances, where the data, labels and paths are in 3 different datasets within the H5 file. I’m not sure if that plays a problem.