How to divide dataset into training, validation and testing

Gutabaga · March 2, 2019, 11:47pm

I have dataeset of image which contain two class, I want to divide into train set, valid set and test set then apply different transformation on them. any help on this my code is

train_transforms = transforms.Compose([transforms.RandomResizedCrop(size=256, scale=(0.8, 1.0)),
transforms.RandomRotation(degrees=15),
transforms.ColorJitter(),
transforms.RandomHorizontalFlip(),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])
])

test_transforms = transforms.Compose([transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])])

validation_transforms = transforms.Compose([transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])])

data_dir=‘Image_folder’
data_set = datasets.ImageFolder(data_dir,transform=train_transforms)
then i apply random split
num_train = len(train_data)
indices = list(range(num_train))
np.random.shuffle(indices)
valid_split = int(np.floor((valid_size) * num_train))
test_split = int(np.floor((test_size) * num_train))
valid_idx, test_idx, train_idx = indices[:valid_split], indices[valid_split:test_split], indices[test_split:]

how to apply different transformation on validation and test dataset?

ptrblck · March 3, 2019, 1:00am

You could create different transformations for the corresponding datasets and use a SubsetRandomSampler to split the data according to the indices. @kevinzakka created a good example for CIFAR10 here.

Gutabaga · March 3, 2019, 3:00am

That is different from what am looking, in that CIFAR10 there is two dataset training and test dataset then division of train into train and valid that is simple. what i have is folder of image with two class i want to create train, valid and test set from it with each with its own transformation.

ptrblck · March 3, 2019, 7:18am

Wouldn’t it work in the same way?
I.e. if you have your indices, you could create a SubsetRandomSampler or alternatively a Subset passing the indices for train, valid and test, and create the three datasets with separate transformations.

As long as you are loading your samples lazily, you shouldn’t run into trouble of cloning the data etc.

Let me know, if that would work for you.

Gutabaga · March 3, 2019, 11:18pm

I don’t know is it possible to apply transformations when i create SubsetRandomSampler , because on my code i already apply when creating data_set with code data_set = datasets.ImageFolder(data_dir,transform=train_transforms) so after getting the indices i just pass them to SubsetRandomSampler.

ptrblck · March 4, 2019, 9:16am

The transformations will still be applied, if you pass a sampler to your DataLoader. In case you would like to apply different transformations for training, validation and test, you could create different Datasets with the corresponding transformation. After this step, you could pass the sampler to your DataLoaders or use Subset instead.

Gutabaga · March 4, 2019, 9:56am

Help on how to create the Datasets with corresponding transformation.

ptrblck · March 5, 2019, 2:09pm

You can just pass them to the initialization of your Datasets:

train_dataset = ImageFolder(
    data_dir,
    transform=train_transform,
)

val_dataset = ImageFolder(
    data_dir,
    transform=validation_transform
)

Tsakunelson · March 5, 2019, 2:50pm

@ptrblck can you please provide an example taking these resulting training/validation loaders as input in a grid search fit function for K-fold cross-validation with Skorch? Thanks

Gutabaga · March 6, 2019, 1:22am

I manage to do what you advice me to do and it work as i wanted. Thanks alot

Gutabaga · March 6, 2019, 1:37am

what is the procedure for loading data in pytorch which are in text format. I think image is compressed to text format so that to have small size.

ptrblck · March 6, 2019, 11:51am

You could use torchtext to work with text data.
What do you mean by “image is compressed to text format”?

Gutabaga · March 6, 2019, 12:21pm

i don’t know how to explain to you, but when i open the .text documents , i found number like this 0.155083 0.182153 0.989079 0.884268 1.235660 0.319778 0.319628 0.192321 0.052866 0.583728 0.026591 0.670798 1.338510 0.088570 0.290366 0.058517 0.099633 0.305436 0.163680 0.353837 0.205698 1.275860 0.762634 0.576910 1.994810 0.865366 0.262767 0.473017 1.168220 0.478582 0.218907 0.056499 0.211185 0.295881 0.691504 1.060290 0.091666 0.981694 0.309257 0.151038 0.198967 0.676790 0.274853 0.089196 0.491523 0.152819 0.145094 0.359992 0.492532 0.628474 0.202545 0.336423 0.681327 0.295714 0.356739 0.161523 0.277415 0.555164 0.181647 0.458813 1.250960 0.224127 0.042612 1.342900 0.064966 0.769184 0.239621 0.098556 0.222091 0.172346 0.135068 0.265591 0.032000 0.608795 0.767316 0.153881 0.438116 0.419858 0.104694 0.264903 0.440572 0.078234 0.263061 0.202075 0.454332 0.199430 0.144669 0.234122 0.074066 0.109871 0.079594 0.555326 1.869740 0.569807 0.015049 0.159012 1.698500 0.855675 0.151098 0.426855 0.926494 0.380013 0.189135 0.189630 0.132648 0.481466 0.093108 0.309382 0.263950 0.703612 0.200222 0.044704 0.303765 0.418349 0.380641 0.339672 0.151210 0.012752 0.058271 0.269059 0.056016 0.589685 0.149780 0.265700 0.401536 0.011937 0.218689 0.060370 0.716481 0.025791 0.511308 0.408257 0.043207 0.076876 0.148586 0.178245 0.137002 0.496746 0.681673 0.292056 0.340917 0.458733 0.178771 0.045775 0.055552 0.418768 0.216293 0.002311 0.198315 0.451854 0.165297 0.355382 0.158428 0.088055 0.209033 0.664311 0.256915 0.484330 0.469853 0.405216 0.276201 0.173210 0.547210 0.142474 0.541711 0.152816 0.067726 0.096706 0.306813 0.076161 0.523094 0.238873 1.273980 0.192488 0.695009 0.724922 0.226557 0.253514 0.499171 0.775749 0.029058 0.451382 0.084472 0.268541 0.253594 0.261123 0.502507 0.308503 0.245415 0.167878 0.207345 0.381694 0.027260 0.261521 0.204200 0.133830 0.384278 0.343707 0.301697 0.138496 0.202745 0.253024 0.546170 0.242478 0.156787 0.086641 0.105299 0.360918 0.037696 0.272467 0.481608 0.726608 0.327605 0.103201 0.310242 0.061576 0.142851 0.068534 0.059296 0.086118 0.388945 0.091547 0.049464 0.170187 0.241044 0.051316 0.148089 0.139078 0.272882 0.067083 0.663984 0.606003 0.244079 1.085210 0.332340 0.265116 0.305749 0.103885 0.215534 0.458578 0.057355 0.398893 0.297799 0.119934 0.212838 0.291776 0.269217 0.142162 0.281937 0.335670 0.494905 0.299956 0.127058 0.276947 0.464927 0.504295 0.420351 0.193635 0.108395 0.064805 0.137357 0.332100 0.130762 0.130792 0.865009 0.239327 0.289546 0.585957 0.100382 0.474250 0.193226 0.027421 0.297429 0.018091 0.351949 0.166004 0.289197 0.570679 0.425431 0.440739 0.401534 0.041404 0.065251 0.458268 0.953394 1.290390 0.695319 0.857916 0.167944 0.856231 0.272845 0.349906 0.227415 0.163596 0.519445 0.194496 0.296388 0.302807 1.624410 0.162605 0.064424 0.035924 0.598064 0.888765 0.441310 0.197744 0.472728 0.235901 0.283135 1.392750 0.115383 0.871872 0.059165 0.205279 0.503267 0.044278 0.156310 0.386661 0.254993 0.129227 0.455207 0.364461 0.153448 0.084767 0.076045 0.224640 0.048457 0.190602 0.612026 0.166118 0.041167 0.238914 0.043665 0.169359 0.320866 0.160062 0.278743 0.287624 0.598015 0.164771 0.086572 0.209738 0.110844

ptrblck · March 6, 2019, 12:43pm

It seems the (normalized) images were saved as text files. I don’t know why it was done, but it won’t save you any memory and will most likely use more space.
You could just load the images and reshape them to the appropriate shape so that you can work with them again.