Split data by label in ImageFolder Dataset

kenkpix · June 28, 2020, 10:46pm

I have a one folder with a lot of images, and DataFrame that contains a labels to all images. I wrote a function to load my data into ImageFolder Dataset.
And this works good, but i have no idea how to split my ImageFolder Dataset into train and validation. Is it possible to do? I have a 100 000 images in my folder, and like 90% of all data is belongs to class 0, and class 1 has only 10% of all data, so i need to split it equivalently. For example train dataset should contain 10 000 samples (5000 for 0 class and 5000 for 1), validation dataset - 2000 samples (1000 for 0, and 1000 for 1).

def load_data(data_path, df):
    train_data = torchvision.datasets.ImageFolder(
        root=data_path,
        transform=torchvision.transforms.ToTensor()
    )
    train_data.classes = ['FAKE', 'REAL']
    train_data.class_to_idx = {k: v for k, v in zip(train_data.classes, [0, 1])}
    
    img_folder = data_path + 'faces_224/'
    
    df = df.sort_values(by='image_name')
    labels_array = np.array(df.label.astype('category').cat.codes)
    files = [img_folder + f for f in os.listdir(img_folder) 
             if os.path.isfile(os.path.join(img_folder, f))]
    samples = np.column_stack((files, labels_array))
    train_data.samples = samples
       
    return train_data

Flint · June 29, 2020, 5:15am

Yes! here is a very helpful example on how to do it

kenkpix · July 3, 2020, 3:39pm

Sorry for not writing for a long time. But this code is only suitable for dividing data into training and validation. As I wrote in my question, i need to split my dataset equivalently. The number of samples for each class should be equal in both dataset (train and validation).

For example train dataset should contain 10 000 samples (5000 for 0 class and 5000 for 1), validation dataset - 2000 samples (1000 for 0, and 1000 for 1).

Akshay_Gulabrao · May 19, 2022, 5:17am

import numpy
# numpy.sort(train.data,key=train.labels)
sorted_by_value = [0]*10
for i in range(10):
  sorted_by_value[i] =(train.data[numpy.where(numpy.array(train.targets) == i)])
  numpy.random.shuffle(sorted_by_value[i])