Import Dataset from Computer

PyTorch2603 · August 21, 2022, 2:06am

Hi all.
I was trying to load a dataset composed of a large number of images which are located on my computer (This PC/Desktop/mydata), where mydata is a folder containing the images. I also need to split the images into 80% training and 20% testing, while splitting training into 80% training and 20% validation.
What would be the code needed for me to do this?
Thanks for the help!

ptrblck · August 21, 2022, 2:38am

I don’t know how the images are stored and how you would like to use them, but if you are working on a multi-class classification and the images are stored in “class folders” (i.e. each folder contains images for a specific class), you could use the ImageFolder class.
However, if this does not fit your use case, you might want to implement a custom Dataset as described e.g. here.
Once you’ve written this dataset, you could then use torch.utils.data.random_split to split this dataset into a training and validation set.

PyTorch2603 · August 21, 2022, 2:45am

Thank you for the reply, I will try to do this. I think I will use ImageFolder because there are certain categories, and in each category there are a set of images which belong to it.

ptrblck · August 21, 2022, 2:47am

That sounds indeed like a good fit for ImageFolder then.

PyTorch2603 · August 21, 2022, 3:24am

Would I be able to use ImageFolder in the following fashion?
import torchvision.datasets
mydataset=torchvision.datasets.ImageFolder(root=“This PC/Desktop/mydata”)
If so I am still unsure how I would use this:

My question being how would I be able to specify that this splits into training and testing, and then use random_split once more to split training into training and validation. I do not see any way that I am specifying that one section is training and the other is testing (or validation).
Could you provide some code and/or an explanation as to how I would do this, as well as any changes I have to make to my usage of ImageFolder? Thanks!

ptrblck · August 21, 2022, 3:45am

The root path looks wrong and you should specify it as a path that Python can find (if this path is working on a Windows system, ignore my advice as I’m not really familiar with Windows setups).

To split the data, you could use:

nb_samples = len(dataset)
print(nb_samples)
# 100

train_split = int(nb_samples*0.8)

train_dataset, val_test_dataset = torch.utils.data.random_split(dataset, [train_split, nb_samples-train_split])
print(len(train_dataset), len(val_test_dataset))
# 80 20

val_split = int(len(val_test_dataset)*0.5)
val_dataset, test_dataset = torch.utils.data.random_split(val_test_dataset, [val_split, len(val_test_dataset)-val_split])
print(len(val_dataset), len(test_dataset))
# 10 10

If you need a more advanced method to create the splits, you could also check e.g. sklearn.model_selection.test_train_split, create the indices for each split, and use torch.utils.data.Subset to create the training, validation, and test datasets.

PyTorch2603 · August 21, 2022, 4:13am

I understand the splitting portion now, thank you!
It is a Windows OS but the root path was incorrect, like you said. It is actually Desktop/mydata.
Also, to get my desired distribution, it would be 0.64 training, 0.16 validation, and 0.2 testing (since I am splitting it 80/20 training/test, but then within training splitting 80/20 training/validation), so I could just change the 0.8 to 0.64, and the 0.5 to 4/9 and it would work, correct?
Thanks for the help, I really appreciate it.

ptrblck · August 21, 2022, 5:32am

Yes, my example just used an 80-10-10 split, but you can of course change these values for your use case.