I am wondering if I can modify
__get_item__ in Dataset to accept multiple indices instead of one index at a time to improve data loading speed from disk using H5 file.
My dataset looks something like this
class HDFDataset(Dataset): def __init__(self, path): self.path = path def __len__(self): return self.len def __getitem__(self, idx): hdf = h5py.File(path, 'r') data = hdf['data'] X = data[idx,:] hdf.close() return X
dset = HDFDataset(path)
I also have a custom batch sampler
def chunk(indices, chunk_size): return torch.split(torch.tensor(indices), chunk_size) class BatchSampler(Sampler): def __init__(self, batch_size, dataset): self.batch_size = batch_size self.dataset = dataset def __iter__(self): #some function return iter(list_of_list_idx)
Sample output from BatchSampler looks as below. Each list represents a batch and each values indicate the indices in a batch
batch_sampler = BatchSampler(batch_size, dataset) for x in batch_sampler: print(x)
[12, 3, 8, 6, 17]
[7, 9, 1, 19, 18]
[13, 4, 2, 5, 14]
[0, 3, 10, 11, 20]
Dataloader looks like
train_dataloader = DataLoader(dset, num_workers=8, batch_sampler=batch_sampler)
This approach works fine for me but data loading takes time as
__get_item__ loads one index at a time from disk.
Since I already know the indices I need to load in each batch using BatchSampler, is there a way I can load entire batch at once in dataset
for e.g if batch 1 indices is
batch_idx = [12, 3, 8, 6, 17]
__get_item__ can accept list of indices than single indice, something like below
def __getitem__(self, batch_idx): hdf = h5py.File(path, 'r') data = hdf['data'] X = data[batch_idx,:] hdf.close() return X
Solution 1: Since I already know the indices in each batch I can just load the data to model as tensors, however I won’t be able to utilize num_workers parameter in DataLoader to speed up.
If there is a way to load data in chunks using Dataset & DataLoader, it would solve my issue.
Appreciate any suggestions.