High-level question: how did you implement dataset splitting and standardisation of training data?

Hi Everyone! I have a general question and was hoping for your advice. This is my first question in this forum – I have searched the forum for a while but could not quite find what I am looking for. So if this has been asked before, I am sorry!

I have a dataset, let’s say with 5000 observations (rows) and 10 variables (columns). I want to randomly split my data into training, validation and testing. Moreover, I want my training dataset to be standardised so each column has zero mean and std of 1. Most importantly, there should be no information leakage from the test dataset, so that the mean and std must be calculated from the training set only!

My question is: on a high level, how would you implement this?

I have tried different things, but am not quite happy yet.

First implementation: I split my data into training-validation-testing in my Dataset Class and return standardised datasets - training, validation and testing. Here, by standardised I mean that I standardised all three datasets using the training mean and training std. I subsequently use one Dataloader which returns batches from the training-validation-test datasets into my training loop. The batches are not further standardised.

Second implementation: I again have a Dataset Class, but this time it simply returns x and y. In a next step I use SubsetRandomSampler to have three different DataLoaders – one for each split. However, I am unsure about how I would now standardise my dataset in this scenario, as to me it seems that I could only standardise my individual batches in the training loop rather than the entire training-validation-test dataset as the DataLoader returns batches. Or would this be more appropriate?

Is there a more elegant way of doing this? I have very little experience with the batch normalisation tool that exists within Pytorch. Can I standardise my data ‘inside the network’ so I do not actually need to return standardised data in the first place?

This may be more of a general ML question – I hope this makes sense!

I would recommend to split the initial dataset either manually or with e.g. sklearn.model_selection.train_test_split.
Once you have the splits, you could create 3 separate Dataset objects, each with an own transformation (where the normalization stats were calculated from the training set) as well as three separate DataLoaders.
This would create a clean splitting in my opinion, which would be easily readable and might thus avoid potential data leaking.

BatchNorm layers normalize the activations inside the model, but usually you would still apply a normalization on the model inputs regardless, if batchnorm is used or not.

Hi ptrblck! thank you for your quick response – I greatly appreciate it. This is great advice as I haven’t thought about having three separate Dataset objects yet, but I think this could add to the readability of my code.

Have a great day!