High-level question: how did you implement dataset splitting and standardisation of training data?

Hi Everyone! I have a general question and was hoping for your advice. This is my first question in this forum – I have searched the forum for a while but could not quite find what I am looking for. So if this has been asked before, I am sorry!

I have a dataset, let’s say with 5000 observations (rows) and 10 variables (columns). I want to randomly split my data into training, validation and testing. Moreover, I want my training dataset to be standardised so each column has zero mean and std of 1. Most importantly, there should be no information leakage from the test dataset, so that the mean and std must be calculated from the training set only!

My question is: on a high level, how would you implement this?

I have tried different things, but am not quite happy yet.

First implementation: I split my data into training-validation-testing in my Dataset Class and return standardised datasets - training, validation and testing. Here, by standardised I mean that I standardised all three datasets using the training mean and training std. I subsequently use one Dataloader which returns batches from the training-validation-test datasets into my training loop. The batches are not further standardised.

Second implementation: I again have a Dataset Class, but this time it simply returns x and y. In a next step I use SubsetRandomSampler to have three different DataLoaders – one for each split. However, I am unsure about how I would now standardise my dataset in this scenario, as to me it seems that I could only standardise my individual batches in the training loop rather than the entire training-validation-test dataset as the DataLoader returns batches. Or would this be more appropriate?

Is there a more elegant way of doing this? I have very little experience with the batch normalisation tool that exists within Pytorch. Can I standardise my data ‘inside the network’ so I do not actually need to return standardised data in the first place?

This may be more of a general ML question – I hope this makes sense!

I would recommend to split the initial dataset either manually or with e.g. sklearn.model_selection.train_test_split.
Once you have the splits, you could create 3 separate Dataset objects, each with an own transformation (where the normalization stats were calculated from the training set) as well as three separate DataLoaders.
This would create a clean splitting in my opinion, which would be easily readable and might thus avoid potential data leaking.

BatchNorm layers normalize the activations inside the model, but usually you would still apply a normalization on the model inputs regardless, if batchnorm is used or not.

Hi ptrblck! thank you for your quick response – I greatly appreciate it. This is great advice as I haven’t thought about having three separate Dataset objects yet, but I think this could add to the readability of my code.

Have a great day!

Hi ptrblck,

I tried building on your suggestions;

  • Split data
  • Make Dataset for each data set, do transformation (I used sklearn’s StandardScaler)
  • Make separate DataLoaders

and came up with this

    class InsData(Dataset):
        def __init__(self, X, y, X_transform=None):
            self.X = X
            self.y = y
            self.X_transform = X_transform
            
        def __len__(self):
            return len(self.y)
        
        def __getitem__(self, idx):

            if self.X_transform is not None:
                # Because it's only reading one line data will be T automatically
                X_data = self.X.iloc[idx,:].to_frame().T
                # Flatten the [1,C] to [C]
                X_data = self.X_transform.transform(X_data).reshape(10)
                X_data = torch.FloatTensor(X_data)
                y_data = torch.LongTensor(self.y.iloc[idx,:].values)
                # y_data will be [Rx1] no [R] flatten in training
                return X_data, y_data
            
            X_data = torch.FloatTensor(self.X.iloc[idx,:].values)
            y_data = torch.LongTensor(self.y.iloc[idx,:].values)
            
            return X_data, y_data

    scaler = StandardScaler()
    scaler.fit(Xtrain)

    trainset = InsData(Xtrain, ytrain, scaler)
    testset = InsData(Xtest, ytest, scaler)

This works but using the transformation slows down the training a lot. I think it’s because Dataset does the transformation on each row individually. Am I structuring this wrongly or does this structure not make sense for tabular data?

I would also like to add feature engineering to this dataset. As I see it I have these options

  • Do data transformation outside the Dataset
  • Do data transformation on the whole of X inside the Dataset possibly in the init section

Would you have any suggestions?

I recommend a tool such as line profiler to debug any performance bottlenecks.

Since __getitem__ does not need X_transform as a parameter, I would suggest to move the transformation to the __init__ function, so it is only executed once.

It depends on the applied transformation.
I.e. you would need to check if any randomness is needed or if you would “leak” information into different splits if the transformation is applied onto the entire dataset before the splitting.

E.g. the StandardScaler should use the stats of the training dataset only and should then be applied onto the validation and testing dataset. Applying it on all samples, would be considered a data leak.
Any random transformation should be applied on the fly in the __getitem__ since you would otherwise just transform all samples once (and keep them static) for the entire training run.

Hi ptrblck,

Thank you for your in-depth responses on the forums! I’m currently working with sequential image classifier using a CNN-Bi-LSTM architecture. Fixed-length sequences (length ~5) of images are fed into my network, which outputs a single label for classification.

My current approach to splitting the data has been creating a Dataset that reads in the image paths and class IDs, and processes the index from __getitem__() to generate sequences in a sliding window fashion. This works perfectly fine, but is it more standard to have this sequence partitioning logic implemented in a Sampler (similar to your reply here)?

I’m currently using random_split for my train, validation, and test splits, and have since realized there is data leakage due to the sliding window approach. How do you recommend fixing this while keeping the model robust (without breaking the sequence partitioning logic)? If I don’t allocate my testing data in a large chunk (i.e. random sampling), I’ll lose a good amount of potential sequences.

Here is my Dataset class below:

class MyDataset(Dataset):
    def __init__(self, img_dir, annotations_file, seq_length, transform=None, target_transform=None):
        self.img_dir = img_dir
        self.img_labels = pd.read_csv(annotations_file, header=None, names=['image', 'class'])
        self.seq_length = seq_length
        self.transform = transform
        self.target_transform = target_transform
        self.class_groups = self.img_labels.groupby('class')

    def __len__(self):
        return sum(len(group) - self.seq_length + 1 for _, group in self.class_groups)

    def __getitem__(self, idx):
        # Find class index and index within the class
        for class_label, group in self.class_groups:
            group_size = len(group) - self.seq_length + 1
            if idx < group_size:
                group_idx = idx
                break
            else:
                idx -= group_size
        
        # Read sequence images
        img_paths = group.iloc[group_idx : group_idx + self.seq_length, 0].tolist()
        images = []
        for img_path in img_paths:
            image = read_image(os.path.join(self.img_dir, img_path)).float()
            if self.transform:
                image = self.transform(image)
            images.append(image)

        images = torch.stack(images)    
        label = torch.tensor(class_label)

        if self.target_transform:
            label = self.target_transform(label)

        return images, label

Also on another note, will the LSTM learn effectively if I am training on sequences from different classes in my batches?

Really appreciate your help!

Sequential datasets are often split in the temporal dimension, e.g. per-day or multiple of days to make sure data won’t be leaked if sequences are used.
Would this work in your use case? E.g. could you split the dataset into temporal sequences and use the first one for training, the next for validation, and the last temp. sequence for testing?
If so, you could create the corresponding indices (with a sliding window offset at the end) and pass these to Subsets to create the data splits.

1 Like