Efficient way of calculating mean, std, etc of Subset Dataset

I have a Dataset class that loads two datasets from their respective folders (train and test).
I would like to create a validation set from the training set. For this I am using the random_split function.
This results in two Subset-Datasets: train_dataset and valid_dataset. For normalization I would like to calculate the mean and std (or min/max) of the training set, but it is not possible to do a simple call of mean() to the dataset, because it is a Subset.
What I can do is calculate the mean, std, etc on the whole dataset, but this would mean that the validation set is also normalized with values from the validation set. Is there an efficient way to calculate statistics of a Subset? Or do I need to loop over it?
This is my code that is calculating the “wrong” mean and std (over the whole dataset)

    dataset = transportation_dataset(data_path=data_folder, train=True)
    # Split the data into training and validation set
    num_train = len(dataset)
    split_valid = int(np.floor(valid_size * num_train))
    split_train = num_train - split_valid
    train_dataset, valid_dataset = random_split(dataset, [split_train, split_valid])
    # Test dataset
    test_dataset = transportation_dataset(data_path=data_folder, train=False)

    # TODO normalize dataset (using scaler trained on training set)
    # get mean and std of trainset (for every feature)
    # Just loop over dataset?
    mean_train, std_train = torch.mean(train_dataset.dataset.data, dim=0), torch.std(train_dataset.dataset.data, dim=0)
    # mean and std is the same as (which should not be the case)
    mean_train, std_train = torch.mean(dataset.data, dim=0), torch.std(dataset.data, dim=0)
    train_dataset.dataset.data = (train_dataset.dataset.data - mean_train) / std_train
    valid_dataset.dataset.data = (valid_dataset.dataset.data - mean_train) / std_train
    test_dataset.data = (test_dataset.data - mean_train) / std_train

I expected to be able to do the following for calculating the mean and std for the Subsets:

    mean_train, std_train = torch.mean(train_dataset.data, dim=0), torch.std(train_dataset.data, dim=0)
    train_dataset.data = (train_datasetdata - mean_train) / std_train
    valid_dataset.data = (valid_dataset.data - mean_train) / std_train
    test_dataset.data = (test_dataset.data - mean_train) / std_train

Since you can directly access the .data inside your dataset, I would recommend to create the dataset indices manually and just split these indices e.g. using sklearn.model_selection.train_test_split.
Once you have the training and validation indices, you could pass the original dataset together with the corresponding indices to a Subset.
The statistics should then be calculated using the training subset and by indexing the subset.dataset.data with the training indices.

Let me know, if this would work for you.

Thanks for the idea! I came up with something a bit different:

    mean_train = torch.mean(train_dataset.dataset.data[train_dataset.indices], dim=0)
    std_train = torch.std(train_dataset.dataset.data[train_dataset.indices], dim=0)
    train_dataset.dataset.data[train_dataset.indices] = (train_dataset.dataset.data[train_dataset.indices] - mean_train) / std_train
    valid_dataset.dataset.data[valid_dataset.indices] = (valid_dataset.dataset.data[valid_dataset.indices] - mean_train) / std_train
    test_dataset.data = (test_dataset.data - mean_train) / std_train

This way I am using pytorch’s random_split, but calculate the stats on the train-subset.

3 Likes

Hi, thanks you for asking and sharing the answer this!

I encountered the same issue and tried to use your code. But I wonder if it is possible get all data with [train_dataset.indices]. I’m not sure whether it’s a list or array or a single index. If [train_dataset.indices] is a single index, two for loop are needed at the end. Is this what you do exactly?

train_dataset.indices is a list with indices. So when I call valid_dataset.dataset.data I get the whole data from the dataset. In order to only get only those that belong to the Subset I can select them with indexing train_dataset.dataset.data[train_dataset.indices]. My data is a tensor, but it could also be a numpy array.

You can also loop over it I think:

current_mean = 0.0
for i in train_dataset.indices:
    current_mean += train_dataset.dataset.data[i]
current_mean /= len(train_dataset.indices)

Hope this helps!

Now I get it. Thank you !!