I’ve isolated the main time bottleneck of my code which seems to be in the following lines:
# subset_idx = len(train_dataset) - (some_indices)
train_dataset.train_data = train_dataset.train_data[subset_idx, :]
train_dataset.train_labels = train_dataset.train_labels[subset_idx,]
Note that the
train_dataset is MNIST in my case. These 2 lines occur X numbers of times in a for loop, and they’re slowing down the loop by approximately 3.5 seconds every iteration!
I’m not using the
DataLoader class, so I’m trying to see if there’s a more efficient way of reducing the total training dataset at each iteration.
It looks like you are re-assigning the sliced data to your dataset. Could you slice and store it in a temporal variable or do you really need the re-assigning.
I think this might slow down your code.
I’ll try this in a bit and see if I can get a speedup.
If you want to use a data loader (which efficiently extracts batches and uses multiprocessing), you can use the
get_batch function on
train_dataset defined in the code below.
Note that this assumes that an instance of
train_labels are returned in the
__getitem__(self, index) function of your dataset class.
from torch.utils import data
def __init__(self, max_size=100):
self.next_batch = 
self.max_size = max_size
def select_sample(self, indList):
self.next_batch = indList
def get_batch(dataset, indices=None, num_workers=2):
sampler = DynamicSampler(len(indices))
loader = data.DataLoader(dataset,
if __name__ == '__main__':
indices = np.arange(5, 100)
batch = get_batch(train_dataset, indices)