I implemented a custom dataset class in PyTorch.
Currently I am reading the data likes this:
# paths
paths = [x for x in glob.glob("polyp_db_sequences/*/*") if os.path.isdir(x)] #all paths that contain images
# splits
train_paths = random.sample(paths, int(len(paths) * 0.8)) #input for train dataset
test_paths = [x for x in paths if x not in train_paths] #input for test dataset
train_set = MyDataset(train_paths)
train_loader = DataLoader(train_set, batch_size = 16)
However, instead of the 80/20 splitting above, I’d like to use Kfold to split the data into train and test set but since I only have the paths, I don’t know how to do that.
I think the following should do the trick for you:
from sklearn.model_selection import KFold
import pandas as pd
import os
myData = os.listdir(myrootdirectory)
kf = KFold(n_splits=5, shuffle=True)
train_index, test_index = next(kf.split(myData), None)
train_data, test_data = myData [train_index], myData [test_index]
If you use pandas dataframes, you can then easily pass the train/test dataframe object to your custom Dataset MyDataset. Dataframes make sampling and shuffling much easier in the training cohort.
Hope this helps!
Edit: A split number of n=5 in KFold corresponds to a split ratio 4:1, a.k.a. 20/80