Removing datapoints from dataset

I would like to remove specific indices from dataset. I tried this but it doesn’t work :frowning:

class MyDataset(Dataset):
   def __init__(self,remove_list):
        self.cifar10 = datasets.CIFAR10(root='./data',

                                        download=False,
                                        train=True,
                                        transform=transforms.ToTensor())
        self.data = self.cifar10.data
        self.targets = self.cifar10.targets
        self.final_data, self.final_targets = self.__remove__(remove_list)
      
    def __getitem__(self, index):
        data, target = self.final_data[index], self.final_targets[index]
        return data, target, index

    def __len__(self):
        return len(self.final_data)

    def __remove__(self, remove_list):
        data = np.delete(self.data, remove_list)
        targets = np.delete(self.targets, remove_list)
        return data, targets
1 Like

I realized it was an issue with the way I deleted items, should be:

   data = np.delete(self.data, remove_list, axis=0)
    
        
        targets = np.delete(self.targets, remove_list, axis=0)

But is it doing the correct thing overall?: Removing specific images based on the index or is the index different every time it is loaded?

1 Like

The data should be loaded in the same order, but of course you could verify it by comparing some random data samples.

np.delete should work fine on numpy arrays. Alternatively, you could also slice the arrays by creating a mask array and setting the values at remove_list to False:

mask = np.ones(len(arr), dtype=bool)
mask[remove_list] = False
data = self.data[mask]
4 Likes

This is so helpful, thank you so much!

Hi ptrblck,
can you explain me in more details what do the last two lines of code do?

The code snippet initializes a mask with True values for all entries first.
The second line of code then uses the remove_list indices to index mask and sets these values to False. In the last line of code self.data is indexed with mask and reassigned to data which will then contain all entries from self.data where mask was set to True.

1 Like

Thank you, very clear! So (correct me if I’m wrong) you also need a further line to do the same on the targets, right? Something like:

targets = self.targets[mask]

Yes, if you are working with a target tensor and want to remove the same indices you would have to add your line of code.

1 Like