I have a 2 datasets supervised_data and validation_data which I used in a previous training
I want to exclude Indices of the validation_data from the supervised_data
I tried torch.utils.data.Subset(supervised_data, validation_data.indices) but this selects only the validation indices that exist in the supervised_data
How can I get a subset of the supervised_data that doesn’t exist in the validation_data?
Could you explain how these datasets were created?
Both datasets will use their own indices in the range [0, len(dataset)-1].
If both datasets are also using the same samples in the same order internally and assuming thatr supervised_data contains more samples than validation_data, then you could use a Subset with indices = torch.arange(len(validation_data), len(supervised_data)).
However, if the aforementioned conditions are not met, you might need to create a mapping between the samples of both datasets or, probably better, split them during their creation in a clean way.
The same dataset was used for Both the validation_data and supervised_data
I am trying to remove the validation_data indices that exist in the supervised data so I can have a subset that doesn’t contain the validation_data
I don’t see any indices used in the Dataset definition and thus assume that you’ve created the supervised_data and validation_data manually before somehow.
If so, I think the easiest approach would be to split the indices before creating the Subsets as seen here:
nb_samples = 1000 # set to your value
indices = np.arange(nb_samples)
train_idx, val_idx = train_test_split(indices, train_size=0.8)
train_dataset = Subset(dataset, train_idx)
val_dataset = Subset(dataset, val_idx)