Remove indices from Dataset

abbab · November 19, 2021, 1:18pm

I have a 2 datasets supervised_data and validation_data which I used in a previous training

I want to exclude Indices of the validation_data from the supervised_data

I tried torch.utils.data.Subset(supervised_data, validation_data.indices) but this selects only the validation indices that exist in the supervised_data

How can I get a subset of the supervised_data that doesn’t exist in the validation_data?

ptrblck · November 20, 2021, 9:02am

Could you explain how these datasets were created?
Both datasets will use their own indices in the range [0, len(dataset)-1].
If both datasets are also using the same samples in the same order internally and assuming thatr supervised_data contains more samples than validation_data, then you could use a Subset with indices = torch.arange(len(validation_data), len(supervised_data)).

However, if the aforementioned conditions are not met, you might need to create a mapping between the samples of both datasets or, probably better, split them during their creation in a clean way.

abbab · November 22, 2021, 10:54am

Thank you for your response

This is the code for the dataset

class SLD_Labeled(Dataset):

    def __init__(self):
        self.root = "/home/ubuntu/workdir/data/real_stuff/"
        self.image_dir = os.listdir(self.root+'images/')
        self.bands = ['B02', 'B03', 'B04']
        
        
    def __len__(self):
        return len(self.image_dir)

    def __getitem__(self, index):
        
        subf = os.path.join(self.root, f'images/'+self.image_dir[index])
        multiband = []

        b = Image.open(subf).convert('RGB')
        label_file = os.path.join(self.root, f'labels/{self.image_dir[index]}').replace('.jpg', '.tif')
        
        label = np.array(Image.open(label_file).convert('P'))
        if ':' in label_file:
            #print('original image')
            pass
        else:
            #print('In generated image')
            label= np.where(label<128, 0, label)
            label= np.where(label>127, 1, label)
        bgr_images = np.array(b)

        
        data_tensor = torch.from_numpy(np.transpose(bgr_images,(2, 1, 0))).float()
        label_tensor = torch.from_numpy(label).long()
       

        return data_tensor, label_tensor,str(self.image_dir[index])

The same dataset was used for Both the validation_data and supervised_data
I am trying to remove the validation_data indices that exist in the supervised data so I can have a subset that doesn’t contain the validation_data

ptrblck · November 22, 2021, 8:26pm

I don’t see any indices used in the Dataset definition and thus assume that you’ve created the supervised_data and validation_data manually before somehow.
If so, I think the easiest approach would be to split the indices before creating the Subsets as seen here:

nb_samples = 1000 # set to your value
indices = np.arange(nb_samples)
train_idx, val_idx = train_test_split(indices, train_size=0.8)

train_dataset = Subset(dataset, train_idx)
val_dataset = Subset(dataset, val_idx)