How to split the datasets in folder wise in pytorch?

There are 46 benign (B) patients and 44 malignant (M) patients. Each patient has 4 images. I want to take 60% patient as train, 20% as valid and 20% for test. The split should always have been done patient-level, meaning images of the same patient should either belong to the train or test set but not be shared among them.
So Train – B: 28 patients + M: 26 patients
Valid- B: 9 patients + M: 9 patients
Test- B: 9 patients + M: 9 patients

How to split the datasets in folder wise in pytorch?

I would recommend to perform the splitting before creating the Datasets in PyTorch, e.g. with sklearn.model_selection.GroupShuffleSplit.

1 Like

@ptrblck Thanks for reply.
Can you please tell me how to use it in my code?
In my case, dataset is splitting in image wise not folder wise.

master= datasets.ImageFolder(data_dir,transform=train_transforms)

valid_size = 0.2
test_size = 0.2
num_train = len(master)
indices = list(range(num_train))
np.random.shuffle(indices)
valid_split = int(np.floor((valid_size) * num_train))
test_split = int(np.floor((valid_size+test_size) * num_train))
valid_idx, test_idx, train_idx = indices[:valid_split], indices[valid_split:test_split], indices[test_split:]

num_workers = 0
train_loader = torch.utils.data.DataLoader(train_idx, batch_size=64,
     num_workers=num_workers)
valid_loader = torch.utils.data.DataLoader(valid_idx, batch_size=64, 
     num_workers=num_workers)
test_loader = torch.utils.data.DataLoader(test_idx, batch_size=64, 
     num_workers=num_workers, shuffle=False)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

@ptrblck can you please check the code. Is it correct?

dataset=torchvision.datasets.ImageFolder(‘images/’)
kfold = LeaveOneGroupOut()
for fold,(train_idx,val_idx) in enumerate(kfold.split(dataset,groups=dataset.targets)):
train_subsampler = torch.utils.data.SubsetRandomSampler(train_idx)
val_subsampler = torch.utils.data.SubsetRandomSampler(val_idx)

LeaveOneGoupOut uses the groups argument for the splitting as described in the docs:

groups: array-like of shape (n_samples,)
Group labels for the samples used while splitting the dataset into train/test set. This ‘groups’ parameter must always be specified to calculate the number of splits, though the other parameters can be omitted.

This groups argument is often used as an additional attribute of the dataset, e.g. in case you are dealing with different patients in a medical dataset and would like to avoid mixing samples from one patient into the training and validation dataset.
In your use case you are using the dataset.targets as the groups argument, so I would assume that you explicitly want to keep some targets only in the training dataset and others in the validation data (which would be uncommon).

yes, i agree. in my scenario, label is my folder name and let suppose i have 10 folder, mean 10 labels, i want 9 labels/folders in training and 1 in test. I think OP asked the same that he want to split based on folders.