Doubt about cross-validation with training, validation and test set

PedroG · January 3, 2022, 4:02pm

hi all,
I have a question where I am apparently not able to find any answer.
Let’s say I have a dataset which is relatively small and I want to be able to test on the entire dataset to reduce the bias of my model.
I thought I could use something like k-fold cross-validation, but no matter wher I look, i only find the case where for each fold, the data is split in only training and test set. What I am interested in is to have for each fold a training, validation AND test data, where I train the model on the test data, determine when to stop training on the validation data and then test on the test data. I would then report the average of the scores obtained on the different test sets.
However, I am unsure on how to split the data and if the mentioned procedure is actually correct.
Would this for example be correct? (I fix [p] to be the validation set) and test on the other different test sets)
Dataset = [o, d, p, x]
Fold1: Train = [o, d], Validation = [p], Test=[x]
Fold2: Train = [o, x], Validation = [p], Test=[d]
Fold3: Train = [d, x], Validation = [p], Test=[o]
Thank for help!

ptrblck · January 3, 2022, 10:35pm

@rasbt published this great post a while ago where he explains cross validation in detail (you should also check other resources from his blog).
To perform the actual splitting you could use e.g. scikit-learn’s methods.

PedroG · January 4, 2022, 9:08am

Hi ptrblck,
thanks for the useful link!
But I am still not sure if what I want to do is correct. I want to use k-fold CV for model-evaluation, I am not interested in any hyper-parameter selection. However, because I am using neural networks, I still need a validation set to decide when to stop training.
In the slides of the link you posted, this case is covered:
Dataset = [o, d, p, x]
Fold1: Train = [o, d, p], Test=[x]
Fold2: Train = [d, p, x], Test=[o]
Fold3: Train = [p, x, o], Test=[d]
Fold4: Train = [x, o, d], Test=[p]
from which I could estimate the performance of my model with fixed hyper-parameters as the average of the performance on the test sets across the different folds.
But no validation set is used here and I am wondering if I should keep the validation set fixed across the different folds (e.g. [p] as in my previous post) or if this should vary as well or also use some inner loop.
What do you think?

my3bikaht · January 4, 2022, 10:18am

IMO, cross-validation is flawed by the definition. So, If possible, test set should be fully separated from training loop. If not, then I’d choose scenario 3: run outer loop picking test subset, then run inner loop on remaining data training model via cross-validation.

Yet, I’d look first into possibility of generating additional samples, either fully synthetic or augmented.

PedroG · January 4, 2022, 10:44am

hi Sergey,
unfortunately, generating additional samples is not a possible option for my application.
So you think this approach
Dataset = [o, d, p, x]
Fold1: Train = [o, d], Validation = [p], Test=[x]
Fold2: Train = [o, x], Validation = [p], Test=[d]
Fold3: Train = [d, x], Validation = [p], Test=[o]
is good? I only want to use [p] for checking when to stop training the network on [o,d], for example.

or is it better to do like this?
Dataset = [o, d, p, x]
Fold1: Train = [o, d], Validation = [p], Test=[x]
Fold2: Train = [d, p], Validation = [x], Test=[o]
Fold3: Train = [p, x], Validation = [o], Test=[d]
Fold4: Train = [x, o], Validation = [d], Test=[p]
Thank you!

SebGruber1996 · January 11, 2022, 10:48am

Both of what you have suggested gives an unbiased estimate of the test set error, which is the most important aspect.
But your second approach has probably lower variance error, since you are also switching up the validation set.

But keep in mind your procedure differs from cross-validation and I also would not call it that way, or else you are confusing people.
At least in your second approach, I can tell what you are doing is nested resampling, where the outer resampling is cross validation and the inner resampling is holdout (see here for more)