Split dataset (advanced way)

Alphonsito25 · September 27, 2022, 6:48pm

Hi!

I would like to randomly split my dataset between training and test, but also I want to make it balanced in my 2 classes, and save this split to future trainings.

This is because I want to perform several trainings with different pretrained models under the same conditions (test images always the same in each training), but the split has to be created randomly only one time in the first place.

Thank you very much!

ptrblck · September 28, 2022, 5:27am

You could use e.g. sklearn.model_selection.train_test_split with the stratify argument and save the split indices once to recreate the dataset splits in future runs (seeding might also work, but saving the indices directly might be the better option).

Alphonsito25 · September 28, 2022, 6:05pm

Thank you! I have been researching the function but I don’t know what to put in each input parameter of it.
If I have a dataset class like:

class mi_dataset(Dataset):
    def __init__(self, csv_file, root_dir, transform):
    	self.annotations= pd.read_csv(csv_file)
    	self.root_dir= root_dir
    	self.transform= transform
    
    def __len__(self):
    	return len(self.annotations)
    def __getitem__(self,index):
    	img_path= os.path.join(self.root_dir, self.annotations.iloc[index, 0])
    	image= io.imread(img_path)
    	y_label= torch.tensor(int(self.annotations.iloc[index, 1])) 
    	
    	if self.transform:
    	    image= self.transform(image)
    	return(image, y_label)

transform= transforms.Compose([transforms.ToPILImage(),transforms.Resize([480, 480]),transforms.ToTensor(),transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])])

dataset= mi_dataset(csv_file= './Dataset/labels.csv', root_dir= './Dataset/images', transform= transform)
train_set, test_set = train_test_split(*arrays=, test_size=0.1, train_size=0.9, random_state=None, shuffle=True, stratify=)

Do you know where I should put in the arguments *arrays and stratify? Also, I don’t know where to save the split indices (adding two more outputs to train_set and test_set??)

ptrblck · September 28, 2022, 11:47pm

You can use the indices in range(len(dataset)) as the input array to split and provide the targets of your dataset to the stratify argument.
The returned indices can then be used to create separate torch.utils.data.Subsets using your dataset and the corresponding split indices.

Alphonsito25 · September 29, 2022, 5:05pm

Like this?

range_train, range_test = train_test_split(range(len(dataset)), test_size=0.1, train_size=0.9, random_state=None, shuffle=True, stratify=range(len(dataset)))

train_set = torch.utils.data.Subset(dataset, range_train)
test_set = torch.utils.data.Subset(dataset, range_test)

If I put this it shows me an error:

The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

ptrblck · September 29, 2022, 5:12pm

No, stratify expects the targets, not a list of indices.

Alphonsito25 · September 29, 2022, 5:30pm

Then, how I put it?
Taking into account that my csv file has 2 columns, the first one has the name of the images, and the second one if it is 1 of the 2 classes (with a 1 or a 0), and it is contained into the dataset object.
Thank you very much.

ptrblck · September 29, 2022, 5:45pm

You could try to extract the targets directly from the CSV file based on your description or alternatively you could also iterate your dataset once and store the target in a new numpy array.

Alphonsito25 · September 29, 2022, 6:36pm

Done it. It works. Thank you!

ado_sar · June 30, 2024, 9:05pm

Random split can also be performed with torch.utils.data.random_split without the need of scikit-learn:

indices = ['img_0', 'img_1', ...]  # A sequence with all indices.
train, val, test = random_split(indices, (0.8, 0.1, 0.1))
train_idx = train.indices  # Similar for val and test.

ptrblck · July 1, 2024, 2:09pm

No, the built-in torch.utils.random_split cannot be used to create balanced class distributions as requested in this topic and the stratify argument mentioned in my post for sklearn.model_selection.train_test_split can be used instead.

ado_sar · July 5, 2024, 7:37pm

Oops! Apologize, I didn’t realize OP was about stratified splits.