How to split dataset into test and validation sets

lucasbasquerotto · October 9, 2023, 7:49pm

A bit late, but for anyone with such a question, if you have a Sized dataset with all data (let’s call it full_dataset), then you can create separate datasets for training, validation and tests using the following code:

import torch
from torch.utils.data import random_split

def split(full_dataset, val_percent, test_percent, random_seed=None):
    amount = len(full_dataset)

    test_amount = (
        int(amount * test_percent)
        if test_percent is not None else 0)
    val_amount = (
        int(amount * val_percent)
        if val_percent is not None else 0)
    train_amount = amount - test_amount - val_amount

    train_dataset, val_dataset, test_dataset = random_split(
        full_dataset,
        (train_amount, val_amount, test_amount),
        generator=(
            torch.Generator().manual_seed(random_seed)
            if random_seed
            else None))
    
    return train_dataset, val_dataset, test_dataset

(The random seed is optional and is to be used if you want reproducibility across different runs. It’s advised if you train a model with the same dataset, loading the model state across different runs, to make sure the test and validation data not to mix with the train data in different runs)

Then you can use as:

train_dataset, val_dataset, test_dataset = split(full_dataset, 0.1, 0.1, 42)

(If you define the validation and test datasets both with 10% of the data, the train dataset will consequently receive 80% of it)

kuraga · March 12, 2024, 6:06pm

I saw here:

Subclasses of torch.utils.data.Dataset / a parameter (train, etc.) for its constructor;
torch.utils.data.random_split / torch.utils.data.Subset;
Different samplers (on the same dataset object) / subclasses of torch.utils.data.Dataloader.

What is the advice as of now?
(For a stranger dataset such as MNIST.)