How to split dataset into test and validation sets

I have a dataset in which the different images are classified into different folders.
I want to split the data to test, train, valid sets.

Please help

5 Likes

You can use built in function torch.utils.data.random_split(dataset, lengths).

Check docs here: https://pytorch.org/docs/stable/data.html#torch.utils.data.random_split
Check source here: https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#random_split

Also, you can see this nice example.

25 Likes
import torch
from torchvision.datasets import MNIST
transform = transforms.Compose([transforms.ToTensor(), 
                                        transforms.Normalize((0.5,), (0.5,))])
dataset = MNIST(root = './data', train = train, transform = transform, download=True)
train_set, val_set = torch.utils.data.random_split(dataset, [50000, 10000])

The function of random_split to split the dataset is not working. The size of train_set and val_set returned are both 60000 which is equal to the initial dataset size.

A similar error is also reported on stack overflow:

Please look into the issue.

Thanks.

4 Likes

Double post, answered here.

You can use the following code for creating the train val split. You can specify the val_split float value (between 0.0 to 1.0) in the train_val_dataset function. You can modify the function and also create a train test val split if you want by splitting the indices of list(range(len(dataset))) in three subsets. Just remember to shuffle the list before splitting else you won’t get all the classes in the three splits since these indices would be used by the Subset class to sample from the original dataset.

import torch
from torchvision.datasets import ImageFolder
from torch.utils.data import Subset
from sklearn.model_selection import train_test_split
from torchvision.transforms import Compose, ToTensor, Resize
from torch.utils.data import DataLoader

def train_val_dataset(dataset, val_split=0.25):
    train_idx, val_idx = train_test_split(list(range(len(dataset))), test_size=val_split)
    datasets = {}
    datasets['train'] = Subset(dataset, train_idx)
    datasets['val'] = Subset(dataset, val_idx)
    return datasets

dataset = ImageFolder('C:\Datasets\lcms-dataset', transform=Compose([Resize((224,224)),ToTensor()]))
print(len(dataset))
datasets = train_val_dataset(dataset)
print(len(datasets['train']))
print(len(datasets['val']))
# The original dataset is available in the Subset class
print(datasets['train'].dataset)

dataloaders = {x:DataLoader(datasets[x],32, shuffle=True, num_workers=4) for x in ['train','val']}
x,y = next(iter(dataloaders['train']))
print(x.shape, y.shape)
50080
37560
12520
Dataset ImageFolder
    Number of datapoints: 50080
    Root location: C:\Datasets\some-dataset
    StandardTransform
Transform: Compose(
               Resize(size=(224, 224), interpolation=PIL.Image.BILINEAR)
               ToTensor()
           )
torch.Size([32, 3, 224, 224]) torch.Size([32])

6 Likes

Could you explain what β€˜32’ (
dataloaders = {x:DataLoader(datasets[x],32, shuffle=True, num_workers=4) for x in ['train','val']} ) means?

I believe it would be the batch_size for the DataLoader.

1 Like

Yes as Amin_Jun pointed out it is the batch size of the datalodaers.

Thanks a lot :slight_smile: !!!

If we add augmentation to images using transform , this will augment validation set too ?
If yes , how can i avoid that ?

You are usually creating separate training and validation Datasets and can thus pass the desired transformations to them. The ImageNet example shows this usage.

1 Like