# Efficient way of calculating mean, std, etc of Subset Dataset

I have a Dataset class that loads two datasets from their respective folders (train and test).
I would like to create a validation set from the training set. For this I am using the `random_split` function.
This results in two Subset-Datasets: `train_dataset` and `valid_dataset`. For normalization I would like to calculate the mean and std (or min/max) of the training set, but it is not possible to do a simple call of `mean()` to the dataset, because it is a Subset.
What I can do is calculate the mean, std, etc on the whole `dataset`, but this would mean that the validation set is also normalized with values from the validation set. Is there an efficient way to calculate statistics of a Subset? Or do I need to loop over it?
This is my code that is calculating the “wrong” mean and std (over the whole dataset)

``````    dataset = transportation_dataset(data_path=data_folder, train=True)
# Split the data into training and validation set
num_train = len(dataset)
split_valid = int(np.floor(valid_size * num_train))
split_train = num_train - split_valid
train_dataset, valid_dataset = random_split(dataset, [split_train, split_valid])
# Test dataset
test_dataset = transportation_dataset(data_path=data_folder, train=False)

# TODO normalize dataset (using scaler trained on training set)
# get mean and std of trainset (for every feature)
# Just loop over dataset?
mean_train, std_train = torch.mean(train_dataset.dataset.data, dim=0), torch.std(train_dataset.dataset.data, dim=0)
# mean and std is the same as (which should not be the case)
mean_train, std_train = torch.mean(dataset.data, dim=0), torch.std(dataset.data, dim=0)
train_dataset.dataset.data = (train_dataset.dataset.data - mean_train) / std_train
valid_dataset.dataset.data = (valid_dataset.dataset.data - mean_train) / std_train
test_dataset.data = (test_dataset.data - mean_train) / std_train
``````

I expected to be able to do the following for calculating the mean and std for the Subsets:

``````    mean_train, std_train = torch.mean(train_dataset.data, dim=0), torch.std(train_dataset.data, dim=0)
train_dataset.data = (train_datasetdata - mean_train) / std_train
valid_dataset.data = (valid_dataset.data - mean_train) / std_train
test_dataset.data = (test_dataset.data - mean_train) / std_train
``````

Since you can directly access the `.data` inside your dataset, I would recommend to create the dataset indices manually and just split these indices e.g. using `sklearn.model_selection.train_test_split`.
Once you have the training and validation indices, you could pass the original dataset together with the corresponding indices to a `Subset`.
The statistics should then be calculated using the training subset and by indexing the `subset.dataset.data` with the training indices.

Let me know, if this would work for you.

Thanks for the idea! I came up with something a bit different:

``````    mean_train = torch.mean(train_dataset.dataset.data[train_dataset.indices], dim=0)
std_train = torch.std(train_dataset.dataset.data[train_dataset.indices], dim=0)
train_dataset.dataset.data[train_dataset.indices] = (train_dataset.dataset.data[train_dataset.indices] - mean_train) / std_train
valid_dataset.dataset.data[valid_dataset.indices] = (valid_dataset.dataset.data[valid_dataset.indices] - mean_train) / std_train
test_dataset.data = (test_dataset.data - mean_train) / std_train
``````

This way I am using pytorch’s `random_split`, but calculate the stats on the train-subset.

3 Likes

I encountered the same issue and tried to use your code. But I wonder if it is possible get all data with [train_dataset.indices]. I’m not sure whether it’s a list or array or a single index. If [train_dataset.indices] is a single index, two for loop are needed at the end. Is this what you do exactly?

`train_dataset.indices` is a list with indices. So when I call `valid_dataset.dataset.data` I get the whole data from the dataset. In order to only get only those that belong to the Subset I can select them with indexing `train_dataset.dataset.data[train_dataset.indices]`. My data is a tensor, but it could also be a numpy array.

You can also loop over it I think:

``````current_mean = 0.0
for i in train_dataset.indices:
current_mean += train_dataset.dataset.data[i]
current_mean /= len(train_dataset.indices)
``````

Hope this helps!

Now I get it. Thank you !!