Hi Everyone! I have a general question and was hoping for your advice. This is my first question in this forum – I have searched the forum for a while but could not quite find what I am looking for. So if this has been asked before, I am sorry!
I have a dataset, let’s say with 5000 observations (rows) and 10 variables (columns). I want to randomly split my data into training, validation and testing. Moreover, I want my training dataset to be standardised so each column has zero mean and std of 1. Most importantly, there should be no information leakage from the test dataset, so that the mean and std must be calculated from the training set only!
My question is: on a high level, how would you implement this?
I have tried different things, but am not quite happy yet.
First implementation: I split my data into training-validation-testing in my Dataset Class and return standardised datasets - training, validation and testing. Here, by standardised I mean that I standardised all three datasets using the training mean and training std. I subsequently use one Dataloader which returns batches from the training-validation-test datasets into my training loop. The batches are not further standardised.
Second implementation: I again have a Dataset Class, but this time it simply returns x and y. In a next step I use SubsetRandomSampler to have three different DataLoaders – one for each split. However, I am unsure about how I would now standardise my dataset in this scenario, as to me it seems that I could only standardise my individual batches in the training loop rather than the entire training-validation-test dataset as the DataLoader returns batches. Or would this be more appropriate?
Is there a more elegant way of doing this? I have very little experience with the batch normalisation tool that exists within Pytorch. Can I standardise my data ‘inside the network’ so I do not actually need to return standardised data in the first place?
This may be more of a general ML question – I hope this makes sense!