I have a Dataset class that loads two datasets from their respective folders (train and test).
I would like to create a validation set from the training set. For this I am using the random_split
function.
This results in two Subset-Datasets: train_dataset
and valid_dataset
. For normalization I would like to calculate the mean and std (or min/max) of the training set, but it is not possible to do a simple call of mean()
to the dataset, because it is a Subset.
What I can do is calculate the mean, std, etc on the whole dataset
, but this would mean that the validation set is also normalized with values from the validation set. Is there an efficient way to calculate statistics of a Subset? Or do I need to loop over it?
This is my code that is calculating the “wrong” mean and std (over the whole dataset)
dataset = transportation_dataset(data_path=data_folder, train=True)
# Split the data into training and validation set
num_train = len(dataset)
split_valid = int(np.floor(valid_size * num_train))
split_train = num_train - split_valid
train_dataset, valid_dataset = random_split(dataset, [split_train, split_valid])
# Test dataset
test_dataset = transportation_dataset(data_path=data_folder, train=False)
# TODO normalize dataset (using scaler trained on training set)
# get mean and std of trainset (for every feature)
# Just loop over dataset?
mean_train, std_train = torch.mean(train_dataset.dataset.data, dim=0), torch.std(train_dataset.dataset.data, dim=0)
# mean and std is the same as (which should not be the case)
mean_train, std_train = torch.mean(dataset.data, dim=0), torch.std(dataset.data, dim=0)
train_dataset.dataset.data = (train_dataset.dataset.data - mean_train) / std_train
valid_dataset.dataset.data = (valid_dataset.dataset.data - mean_train) / std_train
test_dataset.data = (test_dataset.data - mean_train) / std_train
I expected to be able to do the following for calculating the mean and std for the Subsets:
mean_train, std_train = torch.mean(train_dataset.data, dim=0), torch.std(train_dataset.data, dim=0)
train_dataset.data = (train_datasetdata - mean_train) / std_train
valid_dataset.data = (valid_dataset.data - mean_train) / std_train
test_dataset.data = (test_dataset.data - mean_train) / std_train