Efficient Dataset Indexing

I’ve isolated the main time bottleneck of my code which seems to be in the following lines:

# subset_idx = len(train_dataset) - (some_indices)
train_dataset.train_data = train_dataset.train_data[subset_idx, :]
train_dataset.train_labels = train_dataset.train_labels[subset_idx,]

Note that the train_dataset is MNIST in my case. These 2 lines occur X numbers of times in a for loop, and they’re slowing down the loop by approximately 3.5 seconds every iteration!

I’m not using the DataLoader class, so I’m trying to see if there’s a more efficient way of reducing the total training dataset at each iteration.

Thanks!

It looks like you are re-assigning the sliced data to your dataset. Could you slice and store it in a temporal variable or do you really need the re-assigning.
I think this might slow down your code.

I’ll try this in a bit and see if I can get a speedup.

Thanks :slight_smile:

If you want to use a data loader (which efficiently extracts batches and uses multiprocessing), you can use the get_batch function on train_dataset defined in the code below.

Note that this assumes that an instance of train_data and train_labels are returned in the __getitem__(self, index) function of your dataset class.

from torch.utils import data 

class DynamicSampler(object):
    def __init__(self, max_size=100):
        self.next_batch = [0]
        self.max_size = max_size

    def select_sample(self, indList):
        self.next_batch = indList

    def __iter__(self):
        return iter(self.next_batch)

    def __len__(self):
        return self.max_size

def get_batch(dataset, indices=None, num_workers=2):
    sampler = DynamicSampler(len(indices))
    loader = data.DataLoader(dataset, 
                  batch_size=len(indices), 
                  sampler=sampler)
    
    sampler.select_sample(indices)

    return iter(loader).next()

if __name__ == '__main__':
    indices = np.arange(5, 100)
    batch = get_batch(train_dataset, indices)
1 Like