How to split the dataset of images with labeled folder into scikit learn train_test_split?

Previously I was using PyTorch to split my dataset and train my classifier, but now I want to use Sci-Kit learn to train my SVM model. For that reason, I need to split my dataset into train and test set. Now, Sci-Kit learn uses this xtrain, xtest, ytrain, ytest = X, y, test_size=0.3, random_state=42) to split. I am using this to split my data -

from google.colab import drive
data = "/content/drive/My Drive/AMD_new"
train_data = datasets.ImageFolder(data+"/train", transform=transform_train)
test_data = datasets.ImageFolder(data+"/val", transform = transform_test)
#n_classes = test_data.shape[1]
n_classes = len(test_data.classes)

batch_size = 32

dataloader_train =, batch_size, shuffle=True, num_workers=2)
dataloader_test =, batch_size, num_workers=2)

These are 4 folders, labeled, along with images, that are uploaded into Google Drive and I am doing it from Google colab. Can anyone please tell me that how can I split the data into xtrain, xtest and ytrain and so on. Should I connect xtest with my valid folder? and xtrain with my train folder? Then what about ytrain and ytest? I am confused a little bit. Please help me to solve this. Thanks.

@Deb_Prakash_Chatterj I want to implement the same split for my dataset. Can you please tell me how you solved this?

Hey, use Scikit Learn train_test_split, like this -

xtrain, xvalid, ytrain, yvalid = train_test_split(bow[:split_num,:-1], train['label'], test_size=0.3, random_state=42)

Learn from this site - train_test_split


I wanted to use subsetRandomSampler. You have any idea how to use that batches?

Nope. I don’t have any idea.