Split dataset (advanced way)


I would like to randomly split my dataset between training and test, but also I want to make it balanced in my 2 classes, and save this split to future trainings.

This is because I want to perform several trainings with different pretrained models under the same conditions (test images always the same in each training), but the split has to be created randomly only one time in the first place.

Thank you very much!

You could use e.g. sklearn.model_selection.train_test_split with the stratify argument and save the split indices once to recreate the dataset splits in future runs (seeding might also work, but saving the indices directly might be the better option).

1 Like

Thank you! I have been researching the function but I don’t know what to put in each input parameter of it.
If I have a dataset class like:

class mi_dataset(Dataset):
    def __init__(self, csv_file, root_dir, transform):
    	self.annotations= pd.read_csv(csv_file)
    	self.root_dir= root_dir
    	self.transform= transform
    def __len__(self):
    	return len(self.annotations)
    def __getitem__(self,index):
    	img_path= os.path.join(self.root_dir, self.annotations.iloc[index, 0])
    	image= io.imread(img_path)
    	y_label= torch.tensor(int(self.annotations.iloc[index, 1])) 
    	if self.transform:
    	    image= self.transform(image)
    	return(image, y_label)

transform= transforms.Compose([transforms.ToPILImage(),transforms.Resize([480, 480]),transforms.ToTensor(),transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])])

dataset= mi_dataset(csv_file= './Dataset/labels.csv', root_dir= './Dataset/images', transform= transform)
train_set, test_set = train_test_split(*arrays=, test_size=0.1, train_size=0.9, random_state=None, shuffle=True, stratify=)

Do you know where I should put in the arguments *arrays and stratify? Also, I don’t know where to save the split indices (adding two more outputs to train_set and test_set??)

You can use the indices in range(len(dataset)) as the input array to split and provide the targets of your dataset to the stratify argument.
The returned indices can then be used to create separate torch.utils.data.Subsets using your dataset and the corresponding split indices.

1 Like

Like this?

range_train, range_test = train_test_split(range(len(dataset)), test_size=0.1, train_size=0.9, random_state=None, shuffle=True, stratify=range(len(dataset)))

train_set = torch.utils.data.Subset(dataset, range_train)
test_set = torch.utils.data.Subset(dataset, range_test)

If I put this it shows me an error:

The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

No, stratify expects the targets, not a list of indices.

1 Like

Then, how I put it?
Taking into account that my csv file has 2 columns, the first one has the name of the images, and the second one if it is 1 of the 2 classes (with a 1 or a 0), and it is contained into the dataset object.
Thank you very much.

You could try to extract the targets directly from the CSV file based on your description or alternatively you could also iterate your dataset once and store the target in a new numpy array.

1 Like

Done it. It works. Thank you!