Transforms according to type of data

Hi all,

I divide the training, validation and testing in my custom dataset as follows:

total_dataset = speechdataset.SpanishSpeechDataSet(csv_files=csv_files, root_dir=root_dir)
train_size = int(0.8 * len(total_dataset))
val_test_size = len(total_dataset) - train_size
train_dataset, val_test_dataset = torch.utils.data.random_split(total_dataset, [train_size, val_test_size])
valid_size = int(0.9 * len(val_test_dataset))
test_size = len(val_test_dataset) - valid_size
val_dataset, test_dataset = torch.utils.data.random_split(val_test_dataset, [valid_size, test_size])

In this case, I would like to inject noise to the training dataset only, how can I do that? Because I can’t pass the type of data=“train” or “valid” like normally done…

Thanks.
BR,
Shweta.

You could add the noise in the training loop (outside of the Dataset).
Alternatively you could create a Dataset instance for each split (and add the noise to the training dataset), create the split indices, and wrap the datasets together with the corresponding split indices in Subsets.

Alternatively you could create a Dataset instance for each split (and add the noise to the training dataset), create the split indices, and wrap the datasets together with the corresponding split indices in Subset s.

This is what I am looking for but I don’t know how to do this… :frowning:

Here is a small dummy example:

from sklearn.model_selection import train_test_split

dataset_train = MyDataset(noise=...)
dataset_val = MyDataset()
dataset_test = MyDataset()

idx = np.arange(len(dataset_train))
train_idx, val_idx = train_test_split(idx, train_size=0.8)
val_idx, test_idx = train_test_split(val_idx, train_size=0.5)

dataset_train = Subset(dataset_train, train_idx)
dataset_val = Subset(dataset_val, val_idx)
dataset_test = Subset(dataset_test, test_idx)

I am really sorry, I don’t get it.
So, I get the whole dataset and then split it.

total_dataset = speechdataset.SpanishSpeechDataSet(csv_files=csv_files, root_dir=root_dir)

Are you suggesting to create 3 different datasets before the split? Like lets say for train_set I take x rows, for valid y rows and for test z rows.

But then i f I do so, I don’t understand why would I need to do the split again?

No, you should create the “same” dataset three times (for the training dataset, you should add the noise argument, if available).
Each dataset is then passed to Subset with the corresponding indices.

If you are lazily loading the data, you won’t see any performance penalties using this approach.

Ok, thank you.

I didn’t understand this approach. Now, I understand it better. Thanks you.