I would like to randomly split my dataset between training and test, but also I want to make it balanced in my 2 classes, and save this split to future trainings.
This is because I want to perform several trainings with different pretrained models under the same conditions (test images always the same in each training), but the split has to be created randomly only one time in the first place.
You could use e.g. sklearn.model_selection.train_test_split with the stratify argument and save the split indices once to recreate the dataset splits in future runs (seeding might also work, but saving the indices directly might be the better option).
Do you know where I should put in the arguments *arrays and stratify? Also, I don’t know where to save the split indices (adding two more outputs to train_set and test_set??)
You can use the indices in range(len(dataset)) as the input array to split and provide the targets of your dataset to the stratify argument.
The returned indices can then be used to create separate torch.utils.data.Subsets using your dataset and the corresponding split indices.
Then, how I put it?
Taking into account that my csv file has 2 columns, the first one has the name of the images, and the second one if it is 1 of the 2 classes (with a 1 or a 0), and it is contained into the dataset object.
Thank you very much.
You could try to extract the targets directly from the CSV file based on your description or alternatively you could also iterate your dataset once and store the target in a new numpy array.