I am wondering if I can modify __get_item__
in Dataset to accept multiple indices instead of one index at a time to improve data loading speed from disk using H5 file.
My dataset looks something like this
class HDFDataset(Dataset):
def __init__(self, path):
self.path = path
def __len__(self):
return self.len
def __getitem__(self, idx):
hdf = h5py.File(path, 'r')
data = hdf['data']
X = data[idx,:]
hdf.close()
return X
dset = HDFDataset(path)
I also have a custom batch sampler
def chunk(indices, chunk_size):
return torch.split(torch.tensor(indices), chunk_size)
class BatchSampler(Sampler):
def __init__(self, batch_size, dataset):
self.batch_size = batch_size
self.dataset = dataset
def __iter__(self):
#some function
return iter(list_of_list_idx)
Sample output from BatchSampler looks as below. Each list represents a batch and each values indicate the indices in a batch
batch_sampler = BatchSampler(batch_size, dataset)
for x in batch_sampler:
print(x)
[12, 3, 8, 6, 17]
[7, 9, 1, 19, 18]
[13, 4, 2, 5, 14]
[0, 3, 10, 11, 20]
Dataloader looks like
train_dataloader = DataLoader(dset, num_workers=8, batch_sampler=batch_sampler)
This approach works fine for me but data loading takes time as __get_item__
loads one index at a time from disk.
Since I already know the indices I need to load in each batch using BatchSampler, is there a way I can load entire batch at once in dataset
for e.g if batch 1 indices is
batch_idx = [12, 3, 8, 6, 17]
if __get_item__
can accept list of indices than single indice, something like below
def __getitem__(self, batch_idx):
hdf = h5py.File(path, 'r')
data = hdf['data']
X = data[batch_idx,:]
hdf.close()
return X
Solution 1: Since I already know the indices in each batch I can just load the data to model as tensors, however I won’t be able to utilize num_workers parameter in DataLoader to speed up.
If there is a way to load data in chunks using Dataset & DataLoader, it would solve my issue.
Appreciate any suggestions.