Doubts regarding stratified k-fold cross validation

I am trying to perform stratified k-fold cross-validation on a multi-class image classification problem(4 classes) but I have some doubts regarding it.

According to my understanding, we train every fold for a certain number of epochs and then calculate the performance on each fold and average it down and term it as average metric(accuracy or the choice of metric).
I have a doubt that -
1)Do we reset the weights of the model, learning rate, optimiser state for every fold?
2)if Yes(which I highly doubt), then how it is different from a normal hold out method, this is just training the model in different kinds of distribution and as weights are not saved anywhere so this is basically starting from the start for every fold
3)If No, then how to use the same weights, learning rate, optimiser state in each fold?
According to my understanding, learning rate and optimiser state should be changed for every fold while the weights should be continued from the previous folds.
Below is my code, which is using the default first-time weight initialisation of the pre-trained model but the learning rate is being copied from the previous fold which I am unable to understand why? I am assuming, the optimiser is following the same path too as the learning rate.

batch_size = 512
df_train, df_test, splits = cross_validation_train_test(csv_file=file.csv, stratify_colname='labels') # noqa
for fold in range(5):
    print("Fold: ", fold)
    partition, labels = kfold(df_train, df_test, splits, fold, stratify_columns='labels') # noqa
    training_set = Dataset(partition['train_set'], labels, root_dir=root_dir, train_transform=True) # noqa
    validation_set = Dataset(partition['val_set'],labels,root_dir=root_dir,valid_transform=True) # noqa
    test_set = Dataset(partition['test_set'], labels, root_dir=root_dir,test_transform = None)
    train_loader =, shuffle=True, pin_memory=True, num_workers=0, batch_size=batch_size) # noqa
    val_loader =, shuffle=True, pin_memory=True, num_workers=0, batch_size=batch_size)
    test_loader =, shuffle =True, pin_memory=True, num_workers=0, batch_size=batch_size) # noqa
    data_transfer = {'train': train_loader,
                     'valid': val_loader,
                     'test': test_loader
    train_model(model=model_transfer, loader = data_transfer, optimizer = optimizer, criterion = criterion_transfer,scheduler=scheduler, n_epochs = 50) # noqa
  1. Yes, you are resetting the hyperparameters and are training a new model in each iteration.

  2. k-fold CV gives you a better “unbiased” estimate of the generalization performance of your model.

@rasbt explains this techniques here and also compares it to other hold-out methods.

Often you would use k-fold CV if your dataset is small and your overall training might thus finish quickly.
I haven’t seen this technique used for a DL model on e.g. ImageNet and don’t know if you would benefit much, since this dataset is already large.

1 Like

Should not the weights of model be re-initialized after each fold ? Also, does not it require creating a new instance of optimizer for each fold since it uses model.parameters ?

Yes, I think the parameters should be reset in each fold (which I named iteration in my previous post).
If you are resetting the parameters inplace, you might not need to create a new optimizer.
However, you could also make sure to avoid potential issues by just creating a new one.

1 Like

Thank you for the answer. There was not an explicit function call like model_transfer.reset_parameters() to reset parameters of the model in the first example, how is it done ?

.reset_parameters() is an nn.Module method and you could call it on all registered submodules.

Sorry, but why do we tune hyperparameters for each fold? If I’m not misunderstanding, the typical goal of cross-validation is to select the best model hyperparameters (according to the val set average performance) and examine model generalizability. Tuning hyperparams in each fold essentially undermines this and makes the model selection (i.e. measurement of model generalizability) invalid.

In fact, I think the best practice is to keep a holdout set, and then do cross-validation (only split to train and val) on the remaining set for hyperparameter tuning, before combining all remaining data to re-train a final model based on the selected hyperparameters. Happy to discuss more.

To summarize, No, I think for the sake of model selection, one should NOT change hyperparams (incl. learning rate, batch size, weight decay, etc.) across fold. However, the weights should be tuned separately for each fold, and I agree with the latter post on parameter reset by ptrblck.

Yes, I think you are right and my posts weren’t clear enough.
My understanding is one should reset the “training” in each fold, i.e. the model’s parameters, optimizer states, etc. and make sure to train a new model in each fold. The other “hyperparameters” such as learning rate, weight decay etc. should not be changed in each fold.