I have a one folder with a lot of images, and DataFrame that contains a labels to all images. I wrote a function to load my data into ImageFolder Dataset.
And this works good, but i have no idea how to split my ImageFolder Dataset into train and validation. Is it possible to do? I have a 100 000 images in my folder, and like 90% of all data is belongs to class 0, and class 1 has only 10% of all data, so i need to split it equivalently. For example train dataset should contain 10 000 samples (5000 for 0 class and 5000 for 1), validation dataset - 2000 samples (1000 for 0, and 1000 for 1).
def load_data(data_path, df):
train_data = torchvision.datasets.ImageFolder(
root=data_path,
transform=torchvision.transforms.ToTensor()
)
train_data.classes = ['FAKE', 'REAL']
train_data.class_to_idx = {k: v for k, v in zip(train_data.classes, [0, 1])}
img_folder = data_path + 'faces_224/'
df = df.sort_values(by='image_name')
labels_array = np.array(df.label.astype('category').cat.codes)
files = [img_folder + f for f in os.listdir(img_folder)
if os.path.isfile(os.path.join(img_folder, f))]
samples = np.column_stack((files, labels_array))
train_data.samples = samples
return train_data