I was trying to load a dataset composed of a large number of images which are located on my computer (This PC/Desktop/mydata), where mydata is a folder containing the images. I also need to split the images into 80% training and 20% testing, while splitting training into 80% training and 20% validation.
What would be the code needed for me to do this?
Thanks for the help!
I don’t know how the images are stored and how you would like to use them, but if you are working on a multi-class classification and the images are stored in “class folders” (i.e. each folder contains images for a specific class), you could use the
However, if this does not fit your use case, you might want to implement a custom
Dataset as described e.g. here.
Once you’ve written this dataset, you could then use
torch.utils.data.random_split to split this dataset into a training and validation set.
Thank you for the reply, I will try to do this. I think I will use ImageFolder because there are certain categories, and in each category there are a set of images which belong to it.
That sounds indeed like a good fit for
Would I be able to use ImageFolder in the following fashion?
If so I am still unsure how I would use this:
My question being how would I be able to specify that this splits into training and testing, and then use random_split once more to split training into training and validation. I do not see any way that I am specifying that one section is training and the other is testing (or validation).
Could you provide some code and/or an explanation as to how I would do this, as well as any changes I have to make to my usage of ImageFolder? Thanks!
root path looks wrong and you should specify it as a path that Python can find (if this path is working on a Windows system, ignore my advice as I’m not really familiar with Windows setups).
To split the data, you could use:
nb_samples = len(dataset) print(nb_samples) # 100 train_split = int(nb_samples*0.8) train_dataset, val_test_dataset = torch.utils.data.random_split(dataset, [train_split, nb_samples-train_split]) print(len(train_dataset), len(val_test_dataset)) # 80 20 val_split = int(len(val_test_dataset)*0.5) val_dataset, test_dataset = torch.utils.data.random_split(val_test_dataset, [val_split, len(val_test_dataset)-val_split]) print(len(val_dataset), len(test_dataset)) # 10 10
If you need a more advanced method to create the splits, you could also check e.g.
sklearn.model_selection.test_train_split, create the indices for each split, and use
torch.utils.data.Subset to create the training, validation, and test datasets.
I understand the splitting portion now, thank you!
It is a Windows OS but the root path was incorrect, like you said. It is actually Desktop/mydata.
Also, to get my desired distribution, it would be 0.64 training, 0.16 validation, and 0.2 testing (since I am splitting it 80/20 training/test, but then within training splitting 80/20 training/validation), so I could just change the 0.8 to 0.64, and the 0.5 to 4/9 and it would work, correct?
Thanks for the help, I really appreciate it.
Yes, my example just used an 80-10-10 split, but you can of course change these values for your use case.