Kfold Pytorch Custom Dataset

Jam · April 22, 2021, 9:04am

I implemented a custom dataset class in PyTorch.
Currently I am reading the data likes this:

# paths
paths =  [x for x in glob.glob("polyp_db_sequences/*/*") if os.path.isdir(x)] #all paths that contain images

# splits
train_paths = random.sample(paths, int(len(paths) * 0.8)) #input for train dataset
test_paths = [x for x in paths if x not in train_paths] #input for test dataset

train_set = MyDataset(train_paths)
train_loader = DataLoader(train_set, batch_size = 16)

However, instead of the 80/20 splitting above, I’d like to use Kfold to split the data into train and test set but since I only have the paths, I don’t know how to do that.

ElPolloDiablo · April 22, 2021, 10:09am

Hi @Jam ,

I think the following should do the trick for you:

from sklearn.model_selection import KFold
import pandas as pd
import os

myData = os.listdir(myrootdirectory) 
kf = KFold(n_splits=5, shuffle=True)

train_index, test_index = next(kf.split(myData), None)
train_data, test_data = myData [train_index], myData [test_index]

If you use pandas dataframes, you can then easily pass the train/test dataframe object to your custom Dataset MyDataset. Dataframes make sampling and shuffling much easier in the training cohort.

Hope this helps!

Edit: A split number of n=5 in KFold corresponds to a split ratio 4:1, a.k.a. 20/80

Jam · April 22, 2021, 12:33pm

Thanks a lot! This helped a lot, I could make it work now.