Split dataset in validation and training data

Hi All,

So I have a dataset with 411 classes (different persons) and a network that should try to recognise these subjects based on their iris (iris recognition system). But before I train I need to split up my images in validation and training sets. I Have around 20 images per person so I want do distribute it quite evenly thus that for each person (each class) 80% is used for training and 20% for validation.

How do I do this?
Thanks in Advance

Well after you have created the Dataset, you can use sklearn.model_selection.train_test_split to split the dataset for you. More details can be found here.

So I tried using it, so I wrote the following: `
dataset = torchvision.datasets.ImageFolder(DATA_PATH, transform=transform)
strat = np.arange(1,classes)
train_dataset, validate_dataset = train_test_split(dataset,train_size=0.8,stratify=strat)

#Data loader
train_loader = torch.utils.data.DataLoader(train_dataset,batch_size,shuffle=False)
validate_loader = torch.utils.data.DataLoader(validate_dataset, batch_size, shuffle=False)`

But when I try to run it I get the following error: Traceback (most recent call last): File "/Users/HannesDeSmet/Documents/Unif/Thesis/Prototype/Recognition/Recognition.py", line 31, in <module> train_dataset, validate_dataset = train_test_split(dataset,train_size=0.8,stratify=strat) File "/Users/HannesDeSmet/Documents/Unif/Thesis/Prototype/Recognition/venv/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 2141, in train_test_split train, test = next(cv.split(X=arrays[0], y=stratify)) File "/Users/HannesDeSmet/Documents/Unif/Thesis/Prototype/Recognition/venv/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1328, in split X, y, groups = indexable(X, y, groups) File "/Users/HannesDeSmet/Documents/Unif/Thesis/Prototype/Recognition/venv/lib/python3.6/site-packages/sklearn/utils/validation.py", line 237, in indexable check_consistent_length(*result) File "/Users/HannesDeSmet/Documents/Unif/Thesis/Prototype/Recognition/venv/lib/python3.6/site-packages/sklearn/utils/validation.py", line 212, in check_consistent_length " samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [16082, 410]

For the record I have 16082 images with 410 different classes/ subjects