Memory Bleeds during Cross Validation

Hello all! After reading all the different posts on cross validation, and trying to fix my problem on my own, I decided to come and ask the community. In brief, my problem is as follows: when I try running cross validation on my 3D UNet, which I am currently testing using 10 epochs of training, I notice the following:

K-fold =1
Loss for Epoch 1 of validation is: 0.885310267147265
Loss for Epoch 10 of validation is: 0.8392143343624315
K-fold =2
Loss for Epoch 1 of validation is: 0.8566861936920568
Loss for Epoch 10 of validation is: 0.8010107718015972
K-fold =3
Loss for Epoch 1 of validation is: 0.7986778742388675
… and so on.

In other words, it appears than rather than resetting my model for every fold, I am continuing to train the network. The test dataset is small, and I would expect to see similar values for the validation for every K-fold.

For context, the code that I wrote has a train function and a nested train_runner function. The train function differentiates between a train call with or without cross validation, while the train_runner nested function loads the train_data, validation_data, the train_loader and validation_loader (classes based on DataLoader), the U-net model (defined as a separate class) and defines a solver class, which trains the model and takes as arguments the U-net model and several hyperparameters. Before loading the U-net model into the solver, I call a .reset_parameters() function, similar to the ones in previous threads, to reset all my weights. After this I train, save the trained model and call a del train_data, validation_data, train_loader, validation_loader, UNet, solver + torch.cuda.empty_cache().

For the cross validation, for every fold, I call the train_runner function as part of a loop.

Now… I don’t really understand where my bleeding is coming from, as I reset the network parameters for ever call of the nested function? Inside my solver init, I also define my Adam optimizer - do I need to somehow reset this every iteration? However, shouldn’t this be done through my call of the solver each time? Apologies for the long post - it has been a long few days trying to solve this issue.

Here is a mock of my code. Some of the functions used are defined in other bits, and for the interest of space I am leaving them out.

def train(data_parameters, training_parameters, network_parameters, misc_parameters):
    """Training Function

    This function trains a given model using the provided training data.
    
    Args:
        data_parameters (dict): Dictionary containing relevant information for the datafiles.
        training_parameters(dict): Dictionary containing relevant hyperparameters for training the network.
        network_parameters (dict): Contains information relevant parameters
        misc_parameters (dict): Dictionary of aditional hyperparameters
    """

    def _train_runner(data_parameters, training_parameters, network_parameters, misc_parameters):
        """Wrapper for the training operation
        """

        train_data, validation_data = load_data(data_parameters)

        train_loader = data.DataLoader(
            dataset=train_data,
            batch_size=training_parameters['training_batch_size'],
            shuffle=True,
            num_workers=4,
            pin_memory=True
        )

        validation_loader = data.DataLoader(
            dataset=validation_data,
            batch_size=training_parameters['validation_batch_size'],
            shuffle=False,
            num_workers=4,
            pin_memory=True
        )

        if training_parameters['use_pre_trained']:
            model = torch.load(
                training_parameters['pre_trained_path'])
        else:
            model = UNet3D(network_parameters)

        model.reset_parameters()

        solver = Solver(model=model,
                        device=misc_parameters['device'],
                        number_of_classes=network_parameters['number_of_classes'],
                        experiment_name=training_parameters['experiment_name'],
                        optimizer_arguments={'lr': training_parameters['learning_rate'],
                                             'betas': training_parameters['optimizer_beta'],
                                             'eps': training_parameters['optimizer_epsilon'],
                                             'weight_decay': training_parameters['optimizer_weigth_decay']
                                             },
                        model_name=misc_parameters['model_name'],
                        number_epochs=training_parameters['number_of_epochs'],
                        loss_log_period=training_parameters['loss_log_period'],
                        learning_rate_scheduler_step_size=training_parameters[
                            'learning_rate_scheduler_step_size'],
                        learning_rate_scheduler_gamma=training_parameters['learning_rate_scheduler_gamma'],
                        use_last_checkpoint=training_parameters['use_last_checkpoint'],
                        experiment_directory=misc_parameters['experiments_directory'],
                        logs_directory=misc_parameters['logs_directory'],
                        checkpoint_directory=misc_parameters['checkpoint_directory']
                        )

        validation_loss = solver.train(train_loader, validation_loader)

        model_output_path = os.path.join(
            misc_parameters['save_model_directory'], training_parameters['final_model_output_file'])

        create_folder(misc_parameters['save_model_directory'])

        BrainMapperModel.save(model_output_path)

        print("Final Model Saved in: {}".format(model_output_path))

        del train_data, validation_data, train_loader, validation_loader, BrainMapperModel, solver
        torch.cuda.empty_cache()

        return validation_loss

    if data_parameters['k_fold'] is None:

        _ = _train_runner(data_parameters, training_parameters,
                          network_parameters, misc_parameters)

    else:
        print("Training initiated using K-fold Cross Validation!")
        k_fold_losses = []

        for k in range(data_parameters['k_fold']):

            print("K-fold Number: {}".format(k+1))  

            data_parameters['train_list'] = os.path.join(
                data_parameters['data_folder_name'], 'train' + str(k+1)+'.txt')
            data_parameters['validation_list'] = os.path.join(
                data_parameters['data_folder_name'], 'validation' + str(k+1)+'.txt')
            training_parameters['final_model_output_file'] = training_parameters['final_model_output_file'].replace(
                ".pth.tar", str(k+1)+".pth.tar")

            validation_loss = _train_runner(
                data_parameters, training_parameters, network_parameters, misc_parameters)

            k_fold_losses.append(validation_loss)

        for k in range(data_parameters['k_fold']):
            print("K-fold Number: {} Loss: {}".format(k+1, k_fold_losses[k]))
        print("K-fold Cross Validation Avearge Loss: {}".format(np.mean(k_fold_losses)))

And my solver init call:

class Solver():
    """Solver class for the BrainMapper U-net.

    This class contains the pytorch implementation of the U-net solver required for the BrainMapper project.
    """

    def __init__(self,
                 model,
                 device,
                 number_of_classes,
                 experiment_name,
                 optimizer=torch.optim.Adam,
                 optimizer_arguments={},
                 loss_function=MSELoss(),
                 model_name='Model',
                 labels=None,
                 number_epochs=10,
                 loss_log_period=5,
                 learning_rate_scheduler_step_size=5,
                 learning_rate_scheduler_gamma=0.5,
                 use_last_checkpoint=True,
                 experiment_directory='experiments',
                 logs_directory='logs',
                 checkpoint_directory = 'checkpoints'
                 ):

        self.model = model
        self.device = device
        self.optimizer = optimizer(model.parameters(), **optimizer_arguments)

etc... 

Could you initialize some parameters to a predefined values (e.g. -100) in your weight init method for the sake of debugging?
Then for each fold you could check this value and make sure your model was indeed re-initialized.
If you don’t see the expected value, then somehow your “old” model is still used.

1 Like

Hey @ptrblck - thank you for the suggestion. I did try with -100 and, as expected, for each fold the model was indeed re-initialized. I tried this both inside and outside my solver method, with the same results. In this case, could my optimizer be causing the similar results? Or could it just be down to the data (although this would be quite weird…). ?

Thanks for the debugging.
As long as you recreate the optimizer, it should work.
You could check the optimizer.state_dict()['state'] for the running estimates. If they are not empty, something is wrong.
Otherwise, I agree that the result is looking unexpected, but the data might be the next thing to check.

2 Likes

Thank you for your reply. I did check the optimizer.state_dict()['state'] and it comes up as an empty dictionary for every call of the solver. As it turns out, the unexpected results are due to the data. I will continue experimenting with different UNet configurations, as the current model was just a very simple one, and will update the post if needed. Thank you for your suggestions and help @ptrblck!

1 Like