How to split dataset into test and validation sets

Sanmitra · January 7, 2019, 6:39am

I have a dataset in which the different images are classified into different folders.
I want to split the data to test, train, valid sets.

Please help

ranklord · January 7, 2019, 8:09am

You can use built in function torch.utils.data.random_split(dataset, lengths).

Check docs here: https://pytorch.org/docs/stable/data.html#torch.utils.data.random_split
Check source here: https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#random_split

Also, you can see this nice example.

Ashima_Garg · January 2, 2020, 8:11am

import torch
from torchvision.datasets import MNIST
transform = transforms.Compose([transforms.ToTensor(), 
                                        transforms.Normalize((0.5,), (0.5,))])
dataset = MNIST(root = './data', train = train, transform = transform, download=True)
train_set, val_set = torch.utils.data.random_split(dataset, [50000, 10000])

The function of random_split to split the dataset is not working. The size of train_set and val_set returned are both 60000 which is equal to the initial dataset size.

A similar error is also reported on stack overflow:

Please look into the issue.

Thanks.

ptrblck · January 2, 2020, 8:56am

Double post, answered here.

msminhas93 · May 26, 2020, 9:44pm

You can use the following code for creating the train val split. You can specify the val_split float value (between 0.0 to 1.0) in the train_val_dataset function. You can modify the function and also create a train test val split if you want by splitting the indices of list(range(len(dataset))) in three subsets. Just remember to shuffle the list before splitting else you won’t get all the classes in the three splits since these indices would be used by the Subset class to sample from the original dataset.

import torch
from torchvision.datasets import ImageFolder
from torch.utils.data import Subset
from sklearn.model_selection import train_test_split
from torchvision.transforms import Compose, ToTensor, Resize
from torch.utils.data import DataLoader

def train_val_dataset(dataset, val_split=0.25):
    train_idx, val_idx = train_test_split(list(range(len(dataset))), test_size=val_split)
    datasets = {}
    datasets['train'] = Subset(dataset, train_idx)
    datasets['val'] = Subset(dataset, val_idx)
    return datasets

dataset = ImageFolder('C:\Datasets\lcms-dataset', transform=Compose([Resize((224,224)),ToTensor()]))
print(len(dataset))
datasets = train_val_dataset(dataset)
print(len(datasets['train']))
print(len(datasets['val']))
# The original dataset is available in the Subset class
print(datasets['train'].dataset)

dataloaders = {x:DataLoader(datasets[x],32, shuffle=True, num_workers=4) for x in ['train','val']}
x,y = next(iter(dataloaders['train']))
print(x.shape, y.shape)

50080
37560
12520
Dataset ImageFolder
    Number of datapoints: 50080
    Root location: C:\Datasets\some-dataset
    StandardTransform
Transform: Compose(
               Resize(size=(224, 224), interpolation=PIL.Image.BILINEAR)
               ToTensor()
           )
torch.Size([32, 3, 224, 224]) torch.Size([32])

20am847 · July 12, 2020, 3:36pm

Could you explain what ‘32’ (
dataloaders = {x:DataLoader(datasets[x],32, shuffle=True, num_workers=4) for x in ['train','val']} ) means?

Amin_Jun · July 16, 2020, 1:22pm

I believe it would be the batch_size for the DataLoader.

msminhas93 · July 23, 2020, 6:01pm

Yes as Amin_Jun pointed out it is the batch size of the datalodaers.

20am847 · August 4, 2020, 4:22am

Thanks a lot !!!

Arun_Sagar · June 12, 2021, 4:34am

If we add augmentation to images using transform , this will augment validation set too ?
If yes , how can i avoid that ?

ptrblck · June 12, 2021, 5:00am

You are usually creating separate training and validation Datasets and can thus pass the desired transformations to them. The ImageNet example shows this usage.

Kriti_Ohri · September 26, 2021, 6:21am

I have a dataset that i have divided into train and validate datasets, how can we apply train and validate transform to the two splits?

ptrblck · September 26, 2021, 6:57am

You can pass different transformations to each dataset, as they will be applied in the corresponding __getitem__ method.

Kriti_Ohri · October 1, 2021, 3:16pm

This is my code,

creating a custom class for reading my dataset

class DiabeticRetinopathy(Dataset):

def init(self, csv_file, root_dir, total=None,transform = None):

self.annotations = pd.read_csv(csv_file)

self.root_dir = root_dir

self.transform = transform

def len(self):

return len(self.annotations)

def getitem(self, index):

#img_path = os.path.join(self.root_dir, self.annotations.iloc[index, 0])

img_path = os.path.join(self.root_dir, self.annotations.iloc[index].image +".jpg")

image = Image.open(img_path)

y_label = torch.tensor(int(self.annotations.iloc[index,1]))  

if self.transform:

  image = self.transform(image)

  return(image, y_label)

full_dataset = DiabeticRetinopathy(

    csv_file="/content/DR/B.%20Disease%20Grading/B. Disease Grading/2. Groundtruths/a. IDRiD_Disease Grading_Training Labels.csv",

    root_dir="/content/DR/B.%20Disease%20Grading/B. Disease Grading/1. Original Images/a. Training Set", transform=  transforms.ToTensor())

**# Data augmentation for train and validate **

train_transform = transforms.Compose([

                         transforms.Resize((860,860)),

                         transforms.ColorJitter(brightness=0.1, contrast=0.1, saturation=0, hue=0.1),

                         transforms.RandomCrop(828,828),

                         transforms.GaussianBlur(11, sigma=(0.1, 2.0)),

                         transforms.RandomRotation(degrees=90),

                         transforms.RandomHorizontalFlip(p=0.5),

                         transforms.RandomVerticalFlip(p=0.5),

                         transforms.ToTensor(), #converting the dimension from (height,weight,channel) to (channel,height,weight) convention of PyTorch

                  transforms.Normalize([0.4346, 0.2110, 0.0705],[0.3102, 0.1663, 0.0852]) # Normalize by 3 means 3 StD's of the image net, 3 channels

])

validate_transform = transforms.Compose([

                         transforms.Resize((828,828)),

                         transforms.ToTensor(),

                         transforms.Normalize([0.4346, 0.2110, 0.0705],[0.3102, 0.1663, 0.0852]) # Normalize the  3 channels

])

#Splitting the dataset into train and validate

from torchvision import transforms, utils, datasets

from torch.utils.data import Dataset, DataLoader, random_split, SubsetRandomSampler, WeightedRandomSampler

train_sampler = SubsetRandomSampler(list(range(373)))

valid_sampler = SubsetRandomSampler(list(range(40)))

from here I am not able to understand how to apply transformations to the splits (train and validate) and get the train and validate loaders ready.

Please help.

ptrblck · October 1, 2021, 7:10pm

Something like this should work:

train_dataset = DiabeticRetinopathy(csv_file, root_dir, transform=train_transform)
val_dataset = DiabeticRetinopathy(csv_file, root_dir, transform=validate_transform)
train_loader = DataLoader(train_dataset, sampler=train_sampler)
val_loader = DataLoader(val_dataset, sampler=valid_sampler)

Kriti_Ohri · October 3, 2021, 5:20am

Thank you soo much, I will try.

MoNsj · December 6, 2021, 3:22pm

Dear @ptrblck,
I have a dataset containing images. I am doing Federated learning using pytorch and pysyft. In pysyft, we basically create different workers so that the data can be trained on them in a decentralized manner. I have been using the train_test_split and subset to create two training indices which will be sent to the two defined workers for training. I would like to increase my training indices to 10 in order to train my 10 workers on separate data. So the question is how to make a dataset divided into 10 train indices?
Looking forward to your help

Kriti_Ohri · December 6, 2021, 5:09pm

Hello sir,
Iam a beginnner in pytorch. I have a dataset of images that I want to split into train and validate datasets. I realized that the dataset is highly imbalanced containing 134 (mages) → label 0, 20(images)-> label 1,136 (images)->label 2, 74(images)->lable 3 and 49(images)->label 4. So after splitting that dataset into train and validate, I would want to take the training dataset and balance it, and after that I want to apply transformations to train and validate dataset separately and get the train and validate loader ready. How can I go about it and the right approach? What I have done is…

# Custom dataset
class DiabeticRetinopathy(Dataset):
def init(self, csv_file, root_dir, total=None,transform = None):
self.annotations = pd.read_csv(csv_file)
self.root_dir = root_dir
self.transform = transform
def len(self):
return len(self.annotations)
def getitem(self, index):
#img_path = os.path.join(self.root_dir, self.annotations.iloc[index, 0])
img_path = os.path.join(self.root_dir, self.annotations.iloc[index].image +".jpg")
image = Image.open(img_path)
y_label = torch.tensor(int(self.annotations.iloc[index,1]))
if self.transform:
image = self.transform(image)
return(image, y_label)

# transform for the entire dataset
normalize = transforms.Normalize(mean=[x / 255.0 for x in [0.4346, 0.2110, 0.0705]],
std=[x / 255.0 for x in [0.3102, 0.1663, 0.0852]])

my_transforms = transforms.Compose([
transforms.ToTensor(),
normalize
])

full_dataset = DiabeticRetinopathy(
csv_file="/content/B.%20Disease%20Grading/B. Disease Grading/2. Groundtruths/a. IDRiD_Disease Grading_Training Labels.csv",
root_dir="/content/B.%20Disease%20Grading/B. Disease Grading/1. Original Images/a. Training Set", transform=my_transforms)

#Splitting the dataset into train and validate
train_set, val_set=torch.utils.data.random_split(full_dataset, [373,40 ])

# code for balancing

Get all targets

targets = []
for _, target in train_dataset:
targets.append(target)
targets = torch.stack(targets)

Compute samples weight (each sample should get its own weight)

class_sample_count = torch.tensor(
[(targets == t).sum() for t in torch.unique(targets, sorted=True)])
weight = 1. / class_sample_count.float()
samples_weight = torch.tensor([weight[t] for t in targets])

Create sampler, dataset, loader

sampler = WeightedRandomSampler(samples_weight, len(samples_weight))

train_loader = DataLoader(
train_dataset, batch_size=batch_size, num_workers=1, sampler=sampler)

Iterate DataLoader and check class balance for each batch

for i, (x, y) in enumerate(train_loader):
print(“batch index {}, 0/1: {}/{}”.format(
i, (y == 0).sum(), (y == 1).sum(), (y == 2).sum(), (y == 3).sum(), (y == 4).sum()))
#In the above code, how will I make my train dataset

after this I want to apply the train_tranform to the balanced training dataset and create train and validate dataloaders.

train_transform = transforms.Compose([

                         transforms.Resize((2500,2500)),
                         transforms.ColorJitter(brightness=0.1, contrast=0.1, saturation=0, hue=0.1),
                         transforms.RandomCrop(1490,1490),
                         transforms.GaussianBlur(11, sigma=(0.1, 2.0)),
                         transforms.RandomRotation(degrees=90),
                         transforms.RandomHorizontalFlip(p=0.5),
                         transforms.RandomVerticalFlip(p=0.5),
                         transforms.ToTensor(), #converting the dimension from (height,weight,channel) to (channel,height,weight) convention of PyTorch
                  transforms.Normalize([0.4346, 0.2110, 0.0705],[0.3102, 0.1663, 0.0852])

])

senitent_signal · July 5, 2022, 12:31pm

Was looking for this

Varun_Khare · January 2, 2023, 8:04am

Can you write the code if one needs 80:10:10 split for train, val and test ?