A bit late, but for anyone with such a question, if you have a Sized dataset with all data (let’s call it full_dataset
), then you can create separate datasets for training, validation and tests using the following code:
import torch
from torch.utils.data import random_split
def split(full_dataset, val_percent, test_percent, random_seed=None):
amount = len(full_dataset)
test_amount = (
int(amount * test_percent)
if test_percent is not None else 0)
val_amount = (
int(amount * val_percent)
if val_percent is not None else 0)
train_amount = amount - test_amount - val_amount
train_dataset, val_dataset, test_dataset = random_split(
full_dataset,
(train_amount, val_amount, test_amount),
generator=(
torch.Generator().manual_seed(random_seed)
if random_seed
else None))
return train_dataset, val_dataset, test_dataset
(The random seed is optional and is to be used if you want reproducibility across different runs. It’s advised if you train a model with the same dataset, loading the model state across different runs, to make sure the test and validation data not to mix with the train data in different runs)
Then you can use as:
train_dataset, val_dataset, test_dataset = split(full_dataset, 0.1, 0.1, 42)
(If you define the validation and test datasets both with 10% of the data, the train dataset will consequently receive 80% of it)