Suppose I have a dataset with the following classes:
Class A: 3000 items
Class B: 1000 items
Class C: 2000 items
I want to split this dataset in two parts so that there are 25% data in test set. However, how can I do this so that equal percentage of each class is present in the test set? These items should be randomly selected. For e.g., the test data should be like the following:
Class A: 750 items
Class B: 250 items
Class C: 500 items
You could keep lists of pairs of filenames and labels and prepare batches asynchronously in a background worker thread. That should be efficient even when training with millions of images. You might wanna take a look at the torch.utils.data.dataset.Dataset and torch.utils.data.sampler.Sampler classes that you can use in conjunction with the torch.utils.data.DataLoader.
If you load the dataset completely before passing it to the Dataset and DataLoader classes, you could use scikit-learn’s train_test_split with the stratified option.
Are you able to get all the targets without loading the actual data?
If so, you could use train_test_split passing the indices and targets, and use the index splits for your Subsets.
I create the splited data for validation and training, Now I want to pass it to the data loader. I am confused by knowing how to make my data to pass to the data loader indeed I don’t know how to make DatasetTrain and DatasetValid I applied this:
Sorry, for finding the best number of epoch, I use the list of validation loss from all epochs. and then convert the list to the numpy array to find the minimum loss and index and get the optimum epoch there.
with the CPU it works good but when I run the code with GPU it give me error, the code is
val_lossesArray=np.asarray(val_losses)
vvv2=torch.from_numpy(val_lossesArray)
result = np.where(val_lossesArray == np.amin(val_lossesArray))
vv1=result[-1]
EpochFinal=vv1[0]
print("best epoch",Epoc) ```
and the error is ( result = np.where(val_lossesArray == np.amin(val_lossesArray))
TypeError: eq() received an invalid combination of arguments - got (numpy.ndarray), but expected one of:
* (Tensor other)
didn't match because some of the arguments have invalid types: (!numpy.ndarray!)
* (Number other)
didn't match because some of the arguments have invalid types: (!numpy.ndarray!)
)
Hi @ptrblck I was wondering is there any other way to do this now? A torch approach, instead of reading a dataframe doing a train test split and then creating 3 dataloaders and 3 datasets for train/val/split?