Load data in chunks using Dataset

naveenkb · June 3, 2021, 7:38pm

I am wondering if I can modify __get_item__ in Dataset to accept multiple indices instead of one index at a time to improve data loading speed from disk using H5 file.

My dataset looks something like this

class HDFDataset(Dataset):
    def __init__(self, path):
        self.path = path
        
    def __len__(self):
        return self.len

    def __getitem__(self, idx):
        hdf = h5py.File(path, 'r')
        data = hdf['data']
        X = data[idx,:]
        hdf.close()
        return X

dset = HDFDataset(path)

I also have a custom batch sampler

def chunk(indices, chunk_size):
    return torch.split(torch.tensor(indices), chunk_size)


class BatchSampler(Sampler):
    def __init__(self, batch_size, dataset):
        self.batch_size = batch_size
        self.dataset = dataset
    
    def __iter__(self):
        #some function
        return iter(list_of_list_idx)

Sample output from BatchSampler looks as below. Each list represents a batch and each values indicate the indices in a batch

batch_sampler = BatchSampler(batch_size, dataset)
for x in batch_sampler:
      print(x)

[12, 3, 8, 6, 17]
[7, 9, 1, 19, 18]
[13, 4, 2, 5, 14]
[0, 3, 10, 11, 20]

Dataloader looks like
train_dataloader = DataLoader(dset, num_workers=8, batch_sampler=batch_sampler)

This approach works fine for me but data loading takes time as __get_item__ loads one index at a time from disk.

Since I already know the indices I need to load in each batch using BatchSampler, is there a way I can load entire batch at once in dataset

for e.g if batch 1 indices is
batch_idx = [12, 3, 8, 6, 17]

if __get_item__ can accept list of indices than single indice, something like below

def __getitem__(self, batch_idx):
        hdf = h5py.File(path, 'r')
        data = hdf['data']
        X = data[batch_idx,:]
        hdf.close()
        return X

Solution 1: Since I already know the indices in each batch I can just load the data to model as tensors, however I won’t be able to utilize num_workers parameter in DataLoader to speed up.

If there is a way to load data in chunks using Dataset & DataLoader, it would solve my issue.
Appreciate any suggestions.

ejguan · June 4, 2021, 7:37pm

In theory, it should work, since fetcher can take a list of tuple indexes.

github.com

pytorch/pytorch/blob/780faf52caf170a39414097654d325f6e128414e/torch/utils/data/_utils/fetch.py#L47-L52

    
      
          def fetch(self, possibly_batched_index):
              if self.auto_collation:
                  data = [self.dataset[idx] for idx in possibly_batched_index]
              else:
                  data = self.dataset[possibly_batched_index]
              return self.collate_fn(data)

For __getitem__, you can take a tuple of indexes to accomplish your request.