I have a Dataset class that loads two datasets from their respective folders (train and test).
I would like to create a validation set from the training set. For this I am using the random_split function.
This results in two Subset-Datasets: train_dataset and valid_dataset. For normalization I would like to calculate the mean and std (or min/max) of the training set, but it is not possible to do a simple call of mean() to the dataset, because it is a Subset.
What I can do is calculate the mean, std, etc on the whole dataset, but this would mean that the validation set is also normalized with values from the validation set. Is there an efficient way to calculate statistics of a Subset? Or do I need to loop over it?
This is my code that is calculating the “wrong” mean and std (over the whole dataset)
dataset = transportation_dataset(data_path=data_folder, train=True)
# Split the data into training and validation set
num_train = len(dataset)
split_valid = int(np.floor(valid_size * num_train))
split_train = num_train - split_valid
train_dataset, valid_dataset = random_split(dataset, [split_train, split_valid])
# Test dataset
test_dataset = transportation_dataset(data_path=data_folder, train=False)
# TODO normalize dataset (using scaler trained on training set)
# get mean and std of trainset (for every feature)
# Just loop over dataset?
mean_train, std_train = torch.mean(train_dataset.dataset.data, dim=0), torch.std(train_dataset.dataset.data, dim=0)
# mean and std is the same as (which should not be the case)
mean_train, std_train = torch.mean(dataset.data, dim=0), torch.std(dataset.data, dim=0)
train_dataset.dataset.data = (train_dataset.dataset.data - mean_train) / std_train
valid_dataset.dataset.data = (valid_dataset.dataset.data - mean_train) / std_train
test_dataset.data = (test_dataset.data - mean_train) / std_train
I expected to be able to do the following for calculating the mean and std for the Subsets:
mean_train, std_train = torch.mean(train_dataset.data, dim=0), torch.std(train_dataset.data, dim=0)
train_dataset.data = (train_datasetdata - mean_train) / std_train
valid_dataset.data = (valid_dataset.data - mean_train) / std_train
test_dataset.data = (test_dataset.data - mean_train) / std_train