Transforms according to type of data

shwe87 · June 22, 2020, 9:10am

Hi all,

I divide the training, validation and testing in my custom dataset as follows:

total_dataset = speechdataset.SpanishSpeechDataSet(csv_files=csv_files, root_dir=root_dir)
train_size = int(0.8 * len(total_dataset))
val_test_size = len(total_dataset) - train_size
train_dataset, val_test_dataset = torch.utils.data.random_split(total_dataset, [train_size, val_test_size])
valid_size = int(0.9 * len(val_test_dataset))
test_size = len(val_test_dataset) - valid_size
val_dataset, test_dataset = torch.utils.data.random_split(val_test_dataset, [valid_size, test_size])

In this case, I would like to inject noise to the training dataset only, how can I do that? Because I can’t pass the type of data=“train” or “valid” like normally done…

Thanks.
BR,
Shweta.

ptrblck · June 23, 2020, 2:31am

You could add the noise in the training loop (outside of the Dataset).
Alternatively you could create a Dataset instance for each split (and add the noise to the training dataset), create the split indices, and wrap the datasets together with the corresponding split indices in Subsets.

shwe87 · June 23, 2020, 4:35pm

Alternatively you could create a Dataset instance for each split (and add the noise to the training dataset), create the split indices, and wrap the datasets together with the corresponding split indices in Subset s.

This is what I am looking for but I don’t know how to do this…

ptrblck · June 24, 2020, 6:00am

Here is a small dummy example:

from sklearn.model_selection import train_test_split

dataset_train = MyDataset(noise=...)
dataset_val = MyDataset()
dataset_test = MyDataset()

idx = np.arange(len(dataset_train))
train_idx, val_idx = train_test_split(idx, train_size=0.8)
val_idx, test_idx = train_test_split(val_idx, train_size=0.5)

dataset_train = Subset(dataset_train, train_idx)
dataset_val = Subset(dataset_val, val_idx)
dataset_test = Subset(dataset_test, test_idx)

shwe87 · June 24, 2020, 9:38am

I am really sorry, I don’t get it.
So, I get the whole dataset and then split it.

total_dataset = speechdataset.SpanishSpeechDataSet(csv_files=csv_files, root_dir=root_dir)

Are you suggesting to create 3 different datasets before the split? Like lets say for train_set I take x rows, for valid y rows and for test z rows.

But then i f I do so, I don’t understand why would I need to do the split again?

ptrblck · June 24, 2020, 9:41am

No, you should create the “same” dataset three times (for the training dataset, you should add the noise argument, if available).
Each dataset is then passed to Subset with the corresponding indices.

If you are lazily loading the data, you won’t see any performance penalties using this approach.

shwe87 · June 24, 2020, 12:58pm

Ok, thank you.

I didn’t understand this approach. Now, I understand it better. Thanks you.