Recommend the way to load larger h5 files

Hello all, I have a dataset that requires to use h5 files. The dataset size has 1000 images with the total size of 100GB. I have two options:

  1. Save each image to an hdf5 file, then I have total 1000 hdf5 file. In dataloader, I will call
class h5_loader(data.Dataset):
    def __init__(self, file_path):
	    self.file_list = [f for f in glob.glob(os.path.join(file_path, '*.h5'))]
    def __getitem__(self, index):
	    h5_file = h5py.File(file_list[index])
        data = h5_file.get('data')
  1. Save all images into a single hdf5 file, then dataloader can call
class h5_loader(data.Dataset):
    def __init__(self, file_path):
        h5_file = h5py.File(file_path)
        data = h5_file.get('data')
    def __getitem__(self, index):
		...

Which option should I use to speed up data loader time? The first way load each hdf4 in __getitem__ function, while the second way load a single (combined of 1000 images) in the __init__ function.

Depends on some facts:

  1. Do you have enough RAM to contain your whole data
  2. How many CPUs do you have, how many GPUs?
  3. How big is your batchsize?
  4. With how many workers do you train?
  5. Do you do some online/offline augmentation?

Thanks. I have 32 GB Ram with core i7 cpu. 11gb gpu and batch size of 16. I do not use augmentation and number of worker is 1

If your whole dataset fits into your ram, you could load it from one large file inside your dataset’s __init__. This would load all your data once and take a bit longer for the initialization, but would be the fastest way during training, since no additional data has to be loaded.

If that is not possible, you could increase your number of workers and load the images inside your __getitem__. If I remember correctly, HDF5 works as a generator, which means you could also use one large HDF5 file. The advantage is that you don’t need that much RAM, on the other side you would have to load your data during training, which may be a bottleneck (depending on the size of your images, the performance of your GPU, your CPU and your number of workers).

1 Like

Hey!
I’m a first time responder and pretty new to pytorch and deep learning. I am also trying to train a CNN (UNet to be specific) with image data and think I’m running into a problem related to this topic.

I have a h5py data base file that is too big to load (~27GB). It has 8,000 sampls and each sample shape is (14,257,256). I think It’s worth to mention that I am creating that data base and I can control the number of files and each file size.

Acording to the solution here I can create 8000 different h5py files and use the getitem method to get a different sample from a different file every time (In that case does the len method still returns 8000?).

I was thinking of a way to combain these 2 ways and creating 8 files 1000 samples each. In that case the loader would look like(?):

class h5_loader(data.Dataset):
    def __init__(self):
	    self.file_path = '/content/drive/My Drive/Project Dataset/trainFile'
    def __getitem__(self, index):
        i = index/1000
        mod = index%1000
	    h5_file = h5py.File(file_path+str(i)+'.h5','r')
        input = h5_file['Input'][mod]
        target = h5_file['Target'][mod]
        return input, target

Would that work?
What would you say is the most “clean” and fastest way to load my data into the model?

I’m working on google colab from my 8GB RAM laptop (I use 0 workers).

Thanks!

UPDATE

So apparently this is a very BAD idea. I tried to train my model using this option and it was very slow, and I think I figured out why.
The disadvantage of using 8000 files (1 file for each sample) is that the getitem method has to load a file every time the dataloader wants a new sample (but each file is relatively small, because it contain only one sample).

Doing what I did makes the dataloader to load a 1000 samples file just to get out 1 sample out of it, And by the next time the dataloader will want a new sample it will have to load the same file over again.

1 Like

What did you do to resolve this?

I haven’t found an opitmal solution, for the meantime I settled for a smaller dataset.

At first I tried to use John second option (the get item load each sample separately), but it was really slow on my computer (about 28 sec to load each sample).

Than I tried to get creative and divide each epoch into 8 parts (of 1000 samples each), and after each part I deleted the dataloader object and created a new dataloader object with a different dataset, like that:

for j in np.arange(25):
    ########Starting train epoch######### 
  running_loss = 0
  train_loss_vector = []

  for k in range(8):    
      DB_T = My_Dataloader_Train(k)
      train_loader_train = data.DataLoader(dataset = DB_T, batch_size = 16, num_workers=0)
      
      net.train() # setting the model for training mode
      for batch_inx, (specs,masks) in enumerate(train_loader_train):

          specs = specs.to(device).float()
          .
          .
class My_Dataloader_Train(Dataset):
  def __init__(self,i):
    super().__init__()
    self.file_path = '/content/drive/My Drive/Project Dataset/trainFile' +str(i)+ '.h5'
    self.input = h5py.File(self.file_path, 'r')['Input'][:]
    self.target = h5py.File(self.file_path, 'r')['Target'][:]    
    self.len = 1000

It looked like it was working but after (I think) 10 epoches colab crashed with an ‘Unable to open file’ error (which is weird because it already succeeded opening the same file an epoch ago.

Anyway, I’m still wating for a better solution.

I was initially getting an OS B-Tree error when using multiple processes. So I followed the advice in this thread here:

And created a dataclass like this:

class Features_Dataset(data.Dataset):
    def __init__(self, archive, phase):
        self.archive = archive
        self.phase = phase

    def __getitem__(self, index):
        with h5py.File(self.archive, 'r', libver='latest', swmr=True) as archive:
            datum = archive[str(self.phase) + '_all_arrays'][index]
            label = archive[str(self.phase) + '_labels'][index]
            path = archive[str(self.phase) +  '_img_paths'][index]
            return datum, label, path

    def __len__(self):
        with h5py.File(self.archive, 'r', libver='latest', swmr=True) as archive:
            datum = archive[str(self.phase) + '_all_arrays']
            return len(datum)


if __name__ == '__main__':
    train_dataset = Features_Dataset(archive= "featuresdata/train.hdf5", phase= 'train')
    trainloader = data.DataLoader(train_dataset, num_workers=8, batch_size=128)
    print(len(trainloader))
    for i, (data, label, path) in enumerate(trainloader):
        print(path)

Now I don’t get an error anymore, but loading data is super slow. Because of that, the 4 GPUs that I’m trying to utilize are at zero % volatility. I think there should be a fix, or I have written something completely inefficient. I have 150k instances, where the data, labels and paths are in 3 different datasets within the H5 file. I’m not sure if that plays a problem.