Improve the performance of my model

moreshud · March 21, 2021, 9:37pm

Hi everyone,

I am working with 3D dataset on 2D Unet network fed per slices. Having gathered insight from the forum over challenges, I finally with some samples and the results show that my model is overfitting. I would like you suggesting in improving the performance of my model

2021-03-19 23_52_07_UNet_pathologies_loss_plot

KFrank · March 22, 2021, 2:50am

Hi Moreshud!

I wouldn’t say that your model is overfitting (yet).

It is true that your training loss is lower that your validation loss – and
is going down more rapidly. But to me overfitting means that your
validation loss starts going up (even while the training loss continues to
decrease). The idea is that your model is fitting the “random” specifics
of your training set so much that it degrades your model’s previous
ability to make good predictions on your validation set.

So keep training for more epochs as long as your validation loss is
still going down – even if it’s larger than your training loss.

Another observation of greater concern to me is that your training
loss starts out significantly larger than your validation loss. This is
something you should get to the bottom of. Is it just a statistical fluke?
You could rerun your training with different random partitions of your
data into the training and validation sets. Are you computing your
training and validation losses in slightly different ways so that you’re
not making a direct comparison? Is it an outright bug?

Best.

K. Frank

aguennecjacq · March 22, 2021, 9:19am

+1 for what KFrank just said, you need to wait a bit more for your model to overfit.
If you want a good starting point to improve your model, start by playing with your learning rate.

Related article: Understanding Learning Rates and How It Improves Performance in Deep Learning | by Hafidz Zulkifli | Towards Data Science

moreshud · March 23, 2021, 5:14am

Thanks for the explanation

I am training for more epoch as you suggested to improve the model performance.

Due to the nature of the dataset (given as compressed .nii.gz), I converted each of the datasets to .npy in order to train the volumetric data in slices. Doing this, I split the dataset into training, validation, and testing since some of the given files are either without pathologies, with just one pathology, or both (where there is a bit overlap). Currently rerunning with a different splitting ratio of the training to validation.

This is also one of my worries on seeing the result and I have checked what might likely be the cause. I have computed the loss by initializing to zero on every epoch and followed the example given on the PyTorch site though for a classification task.

print('Training & Validation Started......')
    for epoch in epoch_ranger:
        start_time = time.time()
        
        epoch_loss = {"train":{"pathologies":0, "lunglobes":0}, 
                      "valid":{"pathologies":0, "lunglobes":0}}
        epoch_score = {"train":{"pathologies":0, "lunglobes":0},
                       "valid":{"pathologies":0, "lunglobes":0}}
        epoch_channel = {"train":{}, "valid":{}}
        
        # Each epoch has a training and validation phase
        for phase in ['train', 'valid']:
            if phase == 'train':
                model.train() 
            else:
                model.eval() 

            running_pathologies_loss, running_lunglobes_loss = 0.0, 0.0
            running_pathologies_score, running_lunglobes_score  = 0.0, 0.0

            # Iterate over data.
            for data in dataloaders[phase]:                
                inputs, pathology_target = data

                inputs = inputs.to(device) 
                pathology_target = pathology_target.to(device)
                
                # zero the parameter gradients
                optimizer.zero_grad()
               
                # forward
                with torch.set_grad_enabled(phase == 'train'): 
                    outputs = model(inputs) 

                    pathology_loss = criterion["pathologies"](outputs, pathology_target)#.mean()  # [:, [0,1,2], :,:]
                    pathology_pred = torch.nn.Sigmoid()(outputs) >= .5
                    
                    dice_coefficient_pathologies = evaluation_metric(pathology_pred, pathology_target)
                                       
                    # backward + optimize only if in training phase
                    if phase == 'train':
                        pathology_loss.backward() #retain_graph=True 
                        optimizer.step()
                           
                # statistics
                running_pathologies_loss += pathology_loss.item() * inputs.size(0)
                
                running_pathologies_score += dice_coefficient_pathologies.item() * inputs.size(0)
            
                        
            epoch_loss[phase]["pathologies"] = running_pathologies_loss/len(dataloaders[phase].dataset)
            epoch_score[phase]["pathologies"] = running_pathologies_score / len(dataloaders[phase].dataset)
            
            # storing experiment result for visualization
            container[phase]["loss"]["pathologies"].append(epoch_loss[phase]["pathologies"])
            container[phase]["score"]["pathologies"].append(epoch_score[phase]["pathologies"])
                        
            if phase == "valid":
                container["learning_rate"].append([param_group['lr'] for param_group in optimizer.param_groups][0])         
                average_val_loss = (epoch_loss["valid"]["pathologies"] + epoch_loss["valid"]["lunglobes"])/2
                # scheduler.step(average_val_loss)    #ReduceOnPlateau                          
                # scheduler.step()                             
            
                # saving model checkpoint of inference and/or resuming training            
                checkpoint = {
                    'epoch': epoch + 1,
                    'avg_min_val_loss': average_val_loss,
                    'model': model,
                    'model_state_dict': model.state_dict(),
                    'optimizer': optimizer,
                    'optimizer_state_dict': optimizer.state_dict(),
                    'container': container
                }
            
                save_checkpoint(checkpoint, False, checkpoint_path, best_model_path)
            
                if epoch != 0 and average_val_loss < avg_min_val_loss: # pathologies or lunglobes?
                    print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(avg_min_val_loss, average_val_loss))
                    avg_min_val_loss = average_val_loss
         
                    save_checkpoint(checkpoint, True, checkpoint_path, best_model_path)             

        training_time = str(datetime.timedelta(seconds=time.time() - start_time))[:7]
        
        print("Epoch: {}/{}".format(epoch+1, num_epochs),
              "Training | pathologies - loss: {:.4f}".format(epoch_loss["train"]["pathologies"]),
              "score: {:.4f}".format(epoch_score["train"]["pathologies"]),
              "Validation | pathologies - loss: {:.4f}".format(epoch_loss["valid"]["pathologies"]),
              "score: {:.4f}".format(epoch_score["valid"]["pathologies"]),
              "|Time: {}".format(training_time))

moreshud · March 23, 2021, 5:16am

I will check other value of my learning rate with few epoch to see which gives a good starting point. Can I handle this better with learning rate scheduling?

aguennecjacq · March 23, 2021, 8:33am

I would recommend getting the starting learning rate right at first, then do some fine tuning with a learning rate scheduler. For other parameters, I would suggest copying them from the professionals (check a few papers and github codes) when starting. After a while you will be able to roughly estimate how your parameters should be from experience.
Unfortunately, what I’m describing is the ideal situation. This depends a lot on what you are doing and your resources. If your model takes a long time to train or you don’t have the resources to train it quickly (contrarily to popular belief, not every researcher has >100 gpus ), you may not be able to do a lot fine tuning with your hyperparameters.