Hello all! After reading all the different posts on cross validation, and trying to fix my problem on my own, I decided to come and ask the community. In brief, my problem is as follows: when I try running cross validation on my 3D UNet, which I am currently testing using 10 epochs of training, I notice the following:
K-fold =1
Loss for Epoch 1 of validation is: 0.885310267147265
Loss for Epoch 10 of validation is: 0.8392143343624315
K-fold =2
Loss for Epoch 1 of validation is: 0.8566861936920568
Loss for Epoch 10 of validation is: 0.8010107718015972
K-fold =3
Loss for Epoch 1 of validation is: 0.7986778742388675
… and so on.
In other words, it appears than rather than resetting my model for every fold, I am continuing to train the network. The test dataset is small, and I would expect to see similar values for the validation for every K-fold.
For context, the code that I wrote has a train function and a nested train_runner function. The train function differentiates between a train call with or without cross validation, while the train_runner nested function loads the train_data, validation_data, the train_loader and validation_loader (classes based on DataLoader), the U-net model (defined as a separate class) and defines a solver class, which trains the model and takes as arguments the U-net model and several hyperparameters. Before loading the U-net model into the solver, I call a .reset_parameters() function, similar to the ones in previous threads, to reset all my weights. After this I train, save the trained model and call a del train_data, validation_data, train_loader, validation_loader, UNet, solver + torch.cuda.empty_cache().
For the cross validation, for every fold, I call the train_runner function as part of a loop.
Now… I don’t really understand where my bleeding is coming from, as I reset the network parameters for ever call of the nested function? Inside my solver init, I also define my Adam optimizer - do I need to somehow reset this every iteration? However, shouldn’t this be done through my call of the solver each time? Apologies for the long post - it has been a long few days trying to solve this issue.
Here is a mock of my code. Some of the functions used are defined in other bits, and for the interest of space I am leaving them out.
def train(data_parameters, training_parameters, network_parameters, misc_parameters):
"""Training Function
This function trains a given model using the provided training data.
Args:
data_parameters (dict): Dictionary containing relevant information for the datafiles.
training_parameters(dict): Dictionary containing relevant hyperparameters for training the network.
network_parameters (dict): Contains information relevant parameters
misc_parameters (dict): Dictionary of aditional hyperparameters
"""
def _train_runner(data_parameters, training_parameters, network_parameters, misc_parameters):
"""Wrapper for the training operation
"""
train_data, validation_data = load_data(data_parameters)
train_loader = data.DataLoader(
dataset=train_data,
batch_size=training_parameters['training_batch_size'],
shuffle=True,
num_workers=4,
pin_memory=True
)
validation_loader = data.DataLoader(
dataset=validation_data,
batch_size=training_parameters['validation_batch_size'],
shuffle=False,
num_workers=4,
pin_memory=True
)
if training_parameters['use_pre_trained']:
model = torch.load(
training_parameters['pre_trained_path'])
else:
model = UNet3D(network_parameters)
model.reset_parameters()
solver = Solver(model=model,
device=misc_parameters['device'],
number_of_classes=network_parameters['number_of_classes'],
experiment_name=training_parameters['experiment_name'],
optimizer_arguments={'lr': training_parameters['learning_rate'],
'betas': training_parameters['optimizer_beta'],
'eps': training_parameters['optimizer_epsilon'],
'weight_decay': training_parameters['optimizer_weigth_decay']
},
model_name=misc_parameters['model_name'],
number_epochs=training_parameters['number_of_epochs'],
loss_log_period=training_parameters['loss_log_period'],
learning_rate_scheduler_step_size=training_parameters[
'learning_rate_scheduler_step_size'],
learning_rate_scheduler_gamma=training_parameters['learning_rate_scheduler_gamma'],
use_last_checkpoint=training_parameters['use_last_checkpoint'],
experiment_directory=misc_parameters['experiments_directory'],
logs_directory=misc_parameters['logs_directory'],
checkpoint_directory=misc_parameters['checkpoint_directory']
)
validation_loss = solver.train(train_loader, validation_loader)
model_output_path = os.path.join(
misc_parameters['save_model_directory'], training_parameters['final_model_output_file'])
create_folder(misc_parameters['save_model_directory'])
BrainMapperModel.save(model_output_path)
print("Final Model Saved in: {}".format(model_output_path))
del train_data, validation_data, train_loader, validation_loader, BrainMapperModel, solver
torch.cuda.empty_cache()
return validation_loss
if data_parameters['k_fold'] is None:
_ = _train_runner(data_parameters, training_parameters,
network_parameters, misc_parameters)
else:
print("Training initiated using K-fold Cross Validation!")
k_fold_losses = []
for k in range(data_parameters['k_fold']):
print("K-fold Number: {}".format(k+1))
data_parameters['train_list'] = os.path.join(
data_parameters['data_folder_name'], 'train' + str(k+1)+'.txt')
data_parameters['validation_list'] = os.path.join(
data_parameters['data_folder_name'], 'validation' + str(k+1)+'.txt')
training_parameters['final_model_output_file'] = training_parameters['final_model_output_file'].replace(
".pth.tar", str(k+1)+".pth.tar")
validation_loss = _train_runner(
data_parameters, training_parameters, network_parameters, misc_parameters)
k_fold_losses.append(validation_loss)
for k in range(data_parameters['k_fold']):
print("K-fold Number: {} Loss: {}".format(k+1, k_fold_losses[k]))
print("K-fold Cross Validation Avearge Loss: {}".format(np.mean(k_fold_losses)))
And my solver init call:
class Solver():
"""Solver class for the BrainMapper U-net.
This class contains the pytorch implementation of the U-net solver required for the BrainMapper project.
"""
def __init__(self,
model,
device,
number_of_classes,
experiment_name,
optimizer=torch.optim.Adam,
optimizer_arguments={},
loss_function=MSELoss(),
model_name='Model',
labels=None,
number_epochs=10,
loss_log_period=5,
learning_rate_scheduler_step_size=5,
learning_rate_scheduler_gamma=0.5,
use_last_checkpoint=True,
experiment_directory='experiments',
logs_directory='logs',
checkpoint_directory = 'checkpoints'
):
self.model = model
self.device = device
self.optimizer = optimizer(model.parameters(), **optimizer_arguments)
etc...