Doubts regarding stratified k-fold cross validation

I am trying to perform stratified k-fold cross-validation on a multi-class image classification problem(4 classes) but I have some doubts regarding it.

According to my understanding, we train every fold for a certain number of epochs and then calculate the performance on each fold and average it down and term it as average metric(accuracy or the choice of metric).
I have a doubt that -
1)Do we reset the weights of the model, learning rate, optimiser state for every fold?
2)if Yes(which I highly doubt), then how it is different from a normal hold out method, this is just training the model in different kinds of distribution and as weights are not saved anywhere so this is basically starting from the start for every fold
3)If No, then how to use the same weights, learning rate, optimiser state in each fold?
According to my understanding, learning rate and optimiser state should be changed for every fold while the weights should be continued from the previous folds.
Below is my code, which is using the default first-time weight initialisation of the pre-trained model but the learning rate is being copied from the previous fold which I am unable to understand why? I am assuming, the optimiser is following the same path too as the learning rate.

batch_size = 512
df_train, df_test, splits = cross_validation_train_test(csv_file=file.csv, stratify_colname='labels') # noqa
for fold in range(5):
    print("Fold: ", fold)
    partition, labels = kfold(df_train, df_test, splits, fold, stratify_columns='labels') # noqa
    training_set = Dataset(partition['train_set'], labels, root_dir=root_dir, train_transform=True) # noqa
    validation_set = Dataset(partition['val_set'],labels,root_dir=root_dir,valid_transform=True) # noqa
    test_set = Dataset(partition['test_set'], labels, root_dir=root_dir,test_transform = None)
    train_loader =, shuffle=True, pin_memory=True, num_workers=0, batch_size=batch_size) # noqa
    val_loader =, shuffle=True, pin_memory=True, num_workers=0, batch_size=batch_size)
    test_loader =, shuffle =True, pin_memory=True, num_workers=0, batch_size=batch_size) # noqa
    data_transfer = {'train': train_loader,
                     'valid': val_loader,
                     'test': test_loader
    train_model(model=model_transfer, loader = data_transfer, optimizer = optimizer, criterion = criterion_transfer,scheduler=scheduler, n_epochs = 50) # noqa
  1. Yes, you are resetting the hyperparameters and are training a new model in each iteration.

  2. k-fold CV gives you a better “unbiased” estimate of the generalization performance of your model.

@rasbt explains this techniques here and also compares it to other hold-out methods.

Often you would use k-fold CV if your dataset is small and your overall training might thus finish quickly.
I haven’t seen this technique used for a DL model on e.g. ImageNet and don’t know if you would benefit much, since this dataset is already large.

1 Like

Should not the weights of model be re-initialized after each fold ? Also, does not it require creating a new instance of optimizer for each fold since it uses model.parameters ?

Yes, I think the parameters should be reset in each fold (which I named iteration in my previous post).
If you are resetting the parameters inplace, you might not need to create a new optimizer.
However, you could also make sure to avoid potential issues by just creating a new one.

1 Like

Thank you for the answer. There was not an explicit function call like model_transfer.reset_parameters() to reset parameters of the model in the first example, how is it done ?

.reset_parameters() is an nn.Module method and you could call it on all registered submodules.