VGG16 using CIFAR10 not converging

I’m training VGG16 model from scratch on CIFAR10 dataset. The validation loss diverges from the start of the training.

I have tried with Adam optimizer as well as SGD optimizer. I cannot figure out what it is that I am doing incorrectly. Please point me in the right direction.

# Importing Dependencies

import os
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision.datasets import CIFAR10
from torchvision import transforms
from torch.utils.data import DataLoader
from tqdm import tqdm
from datetime import datetime

# Defining model
arch = [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M']

class VGGNet(nn.Module):
    def __init__(self, in_channels, num_classes):
        super().__init__()
        self.in_channels = in_channels
        self.conv_layers = self.create_conv_layers(arch)
        self.fcs = nn.Sequential(
            nn.Linear(in_features=512*1*1, out_features=4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(in_features=4096, out_features=4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, num_classes)
        )

    def forward(self, x):
        x = self.conv_layers(x)
        # print(x.shape)
        x = x.reshape(x.shape[0], -1)
        x = self.fcs(x)
        return x

    def create_conv_layers(self, arch):
        layers = []
        in_channels = self.in_channels
        
        for x in arch:            
            
            if type(x) == int:

                out_channels = x
                layers += [nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
                        nn.BatchNorm2d(x), 
                        nn.ReLU(),
                        ]

                in_channels = x
            
            elif x =='M':
                layers += [nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))]

        return nn.Sequential(*layers)

# Hyperparameters and settings

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)
TRAIN_BATCH_SIZE = 64
VAL_BATCH_SIZE = 16
EPOCHS = 50

train_data = CIFAR10(root=".", train=True, 
                    transform=transforms.Compose([transforms.ToTensor()]), download=True)

# print(len(train_data))
val_data = CIFAR10(root=".", train=False,
                    transform=transforms.Compose([transforms.ToTensor()]), download=True)
# print(len(val_data))


train_loader = DataLoader(train_data, batch_size=TRAIN_BATCH_SIZE, shuffle=True, num_workers=8)
val_loader = DataLoader(val_data, batch_size=VAL_BATCH_SIZE, shuffle=True, num_workers=8)
# print(len(train_loader))
# print(len(val_loader))


num_train_batches = int(len(train_data)/TRAIN_BATCH_SIZE) 
num_val_batches = int(len(val_data)/VAL_BATCH_SIZE)

# Training and Val Loop

model = VGGNet(3, 10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# optim = torch.optim.Adam(model.parameters(), lr=0.01)

scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=10, verbose=True)

# save_path = os.path.join(r"trained_models", f'{datetime.now().strftime("%m%d_%H%M%S")}.pth')

def train_val():
    
    for epoch in range(1, EPOCHS+1):
        print(f"Epoch: {epoch}/20")
        

        model.train()
        total_loss = 0
        for data in train_loader:
            image, target = data[0], data[1]
            image, target = image.to(device), target.to(device) 
            optimizer.zero_grad()
            output = model(image)
            loss = criterion(output, target) 
            total_loss += loss.item()
            loss.backward()
            optimizer.step()
        print(f"Loss : {total_loss / num_train_batches}")
        
        save_path = os.path.join(r"trained_models", f'{datetime.now().strftime("%m%d_%H%M%S")}_{epoch}.pth')

        if epoch % 5 == 0:
            torch.save(model.state_dict(), save_path)

        with torch.no_grad():
            model.eval()
            total_val_loss = 0
            for data in val_loader:
                image, target = data[0], data[1]
                image, target = image.to(device), target.to(device) 
                output = model(image)
                val_loss = criterion(output, target)
                total_val_loss += val_loss

            total_val_loss = total_val_loss/num_val_batches
            print(f"Val Loss: {total_val_loss}")

            scheduler.step(total_val_loss)

Output is :

Epoch: 1/20 Loss : 1.3286100650795292 Val Loss: 1.3787670135498047
> Epoch: 2/20 Loss : 0.822020811685832 Val Loss: 0.948610246181488
> Epoch: 3/20 Loss : 0.6018326392113476 Val Loss: 0.9581698775291443
> Epoch: 4/20 Loss : 0.47134833609764004 Val Loss: 1.2446043491363525
> Epoch: 5/20 Loss : 0.35625831704114524 Val Loss: 0.8038020730018616
> Epoch: 6/20 Loss : 0.27602518926566605 Val Loss: 0.6090452075004578
> Epoch: 7/20 Loss : 0.21279048924686128 Val Loss: 0.6626076102256775
> Epoch: 8/20 Loss : 0.16782210255280214 Val Loss: 0.6386368870735168
> Epoch: 9/20 Loss : 0.12904227719518205 Val Loss: 0.8135524988174438
> Epoch: 10/20 Loss : 0.10961572862077902 Val Loss: 0.727300226688385
> Epoch: 11/20 Loss : 0.08377284912137456 Val Loss: 0.7346469163894653
> Epoch: 12/20 Loss : 0.07044737199237916 Val Loss: 0.8241418600082397
> Epoch: 13/20 Loss : 0.06040401630707726 Val Loss: 0.8411757349967957
> Epoch: 14/20 Loss : 0.05157513573171604 Val Loss: 0.9980310201644897
> Epoch: 15/20 Loss : 0.04703645325243019 Val Loss: 0.7441162467002869
> Epoch: 16/20 Loss : 0.039386494244257594 Val Loss: 0.7185537219047546
> Epoch: 17/20 Loss : 0.0361507039006692 Val Loss: 0.7251362800598145
> Epoch    17: reducing learning rate of group 0 to 1.0000e-03. 
> Epoch: 18/20 Loss : 0.010131187833331622 Val Loss: 0.6911444067955017
> Epoch: 19/20 Loss : 0.004273188020082817 Val Loss: 0.6758599877357483 
> Epoch: 20/20 Loss : 0.0023282255553611917 Val Loss: 0.6790934801101685

I trained it till 50 epochs but the val_loss was still seeing almost same numbers on losses.

  1. Is there anything wrong with my model?
  2. Is there anything wrong in my training or validation loop?
  3. What can I try to make it converge?

Thank You

1 Like

The problem is in AI there is something called overfitting. Basically this happens when your model is able to memorize the train dataset but does poorly on the validation or test datasets. Because your model has learned the exact training set it cannot generalize. Here is a good article on it. To fix it you could look at some more transformations found here. I would recommend random rotation, color jitter, and random resized crop but you can mess around with them all to see which do best.

1 Like

I added a momentum of 0.9 to your SGD and it seemed to fix the overfitting problem.

optimizer = torch.optim.SGD(model.parameters(), lr=0.01,momentum=0.9)

Screen Shot 2021-03-13 at 4.46.05 PM

1 Like

Thank you for pointing out the overfitting issue. I wasn’t sure how to tackle it. The change suggested by @patrickwilliams3 definitely helped in converging the val_loss better. However, my model is still overfitting so I’ll try out the transformations suggested by you.

Thank you. Adding momentum definitely helped with the issue. Would you suggest always using momentum with the optimizer or is it a trial and error thing and finding out what works?

Could you please take a look at this training log and comment whether it looks “normal”.

Epoch: 1/50	Training Loss : 1.5620558992828555	Val Loss: 1.3321219682693481	Val Acc: 0.625
Epoch: 2/50	Training Loss : 1.0371836209312424	Val Loss: 0.9880975484848022	Val Acc: 0.625
Epoch: 3/50	Training Loss : 0.8290961132101391	Val Loss: 0.8982344269752502	Val Acc: 0.625
Epoch: 4/50	Training Loss : 0.7013175389193513	Val Loss: 0.6085207462310791	Val Acc: 0.6875
Epoch: 5/50	Training Loss : 0.6180899054040689	Val Loss: 0.5821341872215271	Val Acc: 0.75
Epoch: 6/50	Training Loss : 0.549087760012473	Val Loss: 0.5268315672874451	Val Acc: 0.875
Epoch: 7/50	Training Loss : 0.49982954312087324	Val Loss: 0.5468194484710693	Val Acc: 0.875
Epoch: 8/50	Training Loss : 0.4651579355530422	Val Loss: 0.4912704527378082	Val Acc: 0.625
Epoch: 9/50	Training Loss : 0.4303335145382625	Val Loss: 0.4511815309524536	Val Acc: 0.6875
Epoch: 10/50	Training Loss : 0.3984605476946172	Val Loss: 0.44223153591156006	Val Acc: 1.0
Epoch: 11/50	Training Loss : 0.36902758542000486	Val Loss: 0.388616144657135	Val Acc: 0.75
Epoch: 12/50	Training Loss : 0.34863812268694955	Val Loss: 0.44266951084136963	Val Acc: 0.9375
Epoch: 13/50	Training Loss : 0.324177663354084	Val Loss: 0.3801344037055969	Val Acc: 0.9375
Epoch: 14/50	Training Loss : 0.30947255454671657	Val Loss: 0.43314695358276367	Val Acc: 0.8125
Epoch: 15/50	Training Loss : 0.2912858441338667	Val Loss: 0.4075799584388733	Val Acc: 0.9375
Epoch: 16/50	Training Loss : 0.27236257262451724	Val Loss: 0.37279966473579407	Val Acc: 0.9375
Epoch: 17/50	Training Loss : 0.25406273975110877	Val Loss: 0.359770804643631	Val Acc: 0.8125
Epoch: 18/50	Training Loss : 0.24467512166789732	Val Loss: 0.34812307357788086	Val Acc: 0.875
Epoch: 19/50	Training Loss : 0.22752507899404334	Val Loss: 0.3812360167503357	Val Acc: 0.9375
Epoch: 20/50	Training Loss : 0.22035224472775178	Val Loss: 0.35899537801742554	Val Acc: 0.875
Epoch: 21/50	Training Loss : 0.20604683225378967	Val Loss: 0.3145991563796997	Val Acc: 0.8125
Epoch: 22/50	Training Loss : 0.19765411956650217	Val Loss: 0.37555989623069763	Val Acc: 0.9375
Epoch: 23/50	Training Loss : 0.18564987197146773	Val Loss: 0.3490401804447174	Val Acc: 0.8125
Epoch: 24/50	Training Loss : 0.1768726637143918	Val Loss: 0.36409130692481995	Val Acc: 0.875
Epoch: 25/50	Training Loss : 0.1659085816439346	Val Loss: 0.35086601972579956	Val Acc: 0.875
Epoch: 26/50	Training Loss : 0.16131601035070922	Val Loss: 0.38393330574035645	Val Acc: 0.875
Epoch: 27/50	Training Loss : 0.15906610154925518	Val Loss: 0.34360015392303467	Val Acc: 0.9375
Epoch: 28/50	Training Loss : 0.1460178377835647	Val Loss: 0.33205556869506836	Val Acc: 0.75
Epoch: 29/50	Training Loss : 0.14060730992785425	Val Loss: 0.31575024127960205	Val Acc: 1.0
Epoch: 30/50	Training Loss : 0.1338219812837765	Val Loss: 0.3339424133300781	Val Acc: 0.875
Epoch: 31/50	Training Loss : 0.1271845905748589	Val Loss: 0.32900387048721313	Val Acc: 0.75
Epoch: 32/50	Training Loss : 0.11935397288035554	Val Loss: 0.34189411997795105	Val Acc: 0.875
Epoch    32: reducing learning rate of group 0 to 1.0000e-03.
Epoch: 33/50	Training Loss : 0.07561924026223837	Val Loss: 0.29794567823410034	Val Acc: 1.0
Epoch: 34/50	Training Loss : 0.05703055260576727	Val Loss: 0.3084920644760132	Val Acc: 0.875
Epoch: 35/50	Training Loss : 0.04979715111610049	Val Loss: 0.3127608597278595	Val Acc: 1.0
Epoch: 36/50	Training Loss : 0.045420369502398736	Val Loss: 0.31517815589904785	Val Acc: 0.9375
Epoch: 37/50	Training Loss : 0.04435844308469067	Val Loss: 0.31868359446525574	Val Acc: 0.9375
Epoch: 38/50	Training Loss : 0.040347291629307704	Val Loss: 0.32625502347946167	Val Acc: 0.8125
Epoch: 39/50	Training Loss : 0.03918177410646084	Val Loss: 0.3302097022533417	Val Acc: 1.0
Epoch: 40/50	Training Loss : 0.037286260129173486	Val Loss: 0.3412473797798157	Val Acc: 1.0
Epoch: 41/50	Training Loss : 0.03364361599659371	Val Loss: 0.3338499069213867	Val Acc: 0.6875
Epoch: 42/50	Training Loss : 0.03273921512024742	Val Loss: 0.3395068645477295	Val Acc: 0.9375
Epoch: 43/50	Training Loss : 0.03200470064675151	Val Loss: 0.3371736407279968	Val Acc: 1.0
Epoch: 44/50	Training Loss : 0.032651439529966894	Val Loss: 0.3381207287311554	Val Acc: 0.9375
Epoch    44: reducing learning rate of group 0 to 1.0000e-04.
Epoch: 45/50	Training Loss : 0.028502449316418935	Val Loss: 0.3405081033706665	Val Acc: 0.9375
Epoch: 46/50	Training Loss : 0.02839259219365111	Val Loss: 0.3350020945072174	Val Acc: 0.875
Epoch: 47/50	Training Loss : 0.02704484095001835	Val Loss: 0.3337722718715668	Val Acc: 1.0
Epoch: 48/50	Training Loss : 0.027093719671499115	Val Loss: 0.3422480821609497	Val Acc: 0.9375
Epoch: 49/50	Training Loss : 0.02701874623452361	Val Loss: 0.3457319140434265	Val Acc: 0.875
Epoch: 50/50	Training Loss : 0.02580867528640296	Val Loss: 0.34399473667144775	Val Acc: 0.9375

Questions:

  1. Is it normal to see validation accuracy reach 1?
  2. Are the fluctuations normal? Sometimes, the val_loss decreases but the val_accuracy also decreases. I am under the impression that decrease in loss means increase in accuracy. Am I correct/incorrect in thinking that?

Thank You

Is it possible your validation accuracy is for a single batch instead of the entire validation set? If so the fluctuation would be perfectly normal since your accuracy is based on only 16 predictions which would fluctuate heavily.

Otherwise, the heavy fluctuations in your validation set would not make sense across a larger sample, especially as the training and validation losses steadily decline.

You are right. I was not reporting the accuracy correctly. I referred to this tutorial to fix my training loop. It looks like this now:

def train_val():
    
    for epoch in range(1, EPOCHS+1):
        print(f"Epoch: {epoch}/{EPOCHS}", end='\t')
        model.train()
        
        running_loss = 0
        total = 0
        correct = 0
        for data in train_loader:
            image, target = data[0], data[1]
            image, target = image.to(device), target.to(device) 
            optimizer.zero_grad()
            output = model(image)
            loss = criterion(output, target) 
            running_loss += loss.item()
            
            _, pred = torch.max(output, dim=1)
            total += target.size(0)
            correct += torch.sum(pred == target).item()
            
            loss.backward()
            optimizer.step()
        print(f"Training Loss: {running_loss/len(train_loader):.3f}\tTraining Acc: {correct/total}", end='\t')
        
        save_path = os.path.join(r"trained_models", f'{datetime.now().strftime("%m%d_%H%M%S")}_{epoch}.pth')

        if epoch % 5 == 0:
            torch.save(model.state_dict(), save_path)

        with torch.no_grad():
            
            model.eval()
            running_val_loss = 0
            total = 0
            correct = 0
            for data in val_loader:
                image, target = data[0], data[1]
                image, target = image.to(device), target.to(device) 
                output = model(image)
                val_loss = criterion(output, target)
                running_val_loss += val_loss
                _, pred = torch.max(output, dim=1)
                correct += torch.sum(pred == target).item()
                total += target.size(0)
            running_val_loss = running_val_loss/len(val_loader)
            print(f"Val Loss: {running_val_loss:.3f}\tVal Acc: {correct/total}")
            
            scheduler.step(running_val_loss)

Does it look okay to you? Especially how i am calculating losses and accuracies.

The training log also makes more sense to me now. Here it is.

Epoch: 1/50	Training Loss: 1.517	Training Acc: 0.43768	Val Loss: 1.131	Val Acc: 0.5953
Epoch: 2/50	Training Loss: 1.006	Training Acc: 0.65064	Val Loss: 0.939	Val Acc: 0.6759
Epoch: 3/50	Training Loss: 0.795	Training Acc: 0.73206	Val Loss: 0.726	Val Acc: 0.7514
Epoch: 4/50	Training Loss: 0.684	Training Acc: 0.77056	Val Loss: 0.610	Val Acc: 0.7947
Epoch: 5/50	Training Loss: 0.602	Training Acc: 0.79898	Val Loss: 0.608	Val Acc: 0.7956
Epoch: 6/50	Training Loss: 0.548	Training Acc: 0.81684	Val Loss: 0.532	Val Acc: 0.8201
Epoch: 7/50	Training Loss: 0.497	Training Acc: 0.83404	Val Loss: 0.516	Val Acc: 0.8304
Epoch: 8/50	Training Loss: 0.458	Training Acc: 0.84652	Val Loss: 0.503	Val Acc: 0.8312
Epoch: 9/50	Training Loss: 0.419	Training Acc: 0.85984	Val Loss: 0.467	Val Acc: 0.8426
Epoch: 10/50	Training Loss: 0.393	Training Acc: 0.87026	Val Loss: 0.450	Val Acc: 0.8527
Epoch: 11/50	Training Loss: 0.368	Training Acc: 0.87588	Val Loss: 0.414	Val Acc: 0.8645
Epoch: 12/50	Training Loss: 0.343	Training Acc: 0.88578	Val Loss: 0.388	Val Acc: 0.8681
Epoch: 13/50	Training Loss: 0.325	Training Acc: 0.893	Val Loss: 0.397	Val Acc: 0.867
Epoch: 14/50	Training Loss: 0.298	Training Acc: 0.89908	Val Loss: 0.380	Val Acc: 0.8731
Epoch: 15/50	Training Loss: 0.284	Training Acc: 0.90476	Val Loss: 0.355	Val Acc: 0.8803
Epoch: 16/50	Training Loss: 0.268	Training Acc: 0.91034	Val Loss: 0.370	Val Acc: 0.8766
Epoch: 17/50	Training Loss: 0.252	Training Acc: 0.9159	Val Loss: 0.353	Val Acc: 0.8792
Epoch: 18/50	Training Loss: 0.243	Training Acc: 0.91876	Val Loss: 0.368	Val Acc: 0.8841
Epoch: 19/50	Training Loss: 0.227	Training Acc: 0.92448	Val Loss: 0.366	Val Acc: 0.8833
Epoch: 20/50	Training Loss: 0.215	Training Acc: 0.92824	Val Loss: 0.329	Val Acc: 0.8939
Epoch: 21/50	Training Loss: 0.207	Training Acc: 0.93096	Val Loss: 0.342	Val Acc: 0.8909
Epoch: 22/50	Training Loss: 0.195	Training Acc: 0.93432	Val Loss: 0.352	Val Acc: 0.8924
Epoch: 23/50	Training Loss: 0.184	Training Acc: 0.93768	Val Loss: 0.353	Val Acc: 0.8857
Epoch: 24/50	Training Loss: 0.176	Training Acc: 0.94022	Val Loss: 0.370	Val Acc: 0.8905
Epoch: 25/50	Training Loss: 0.168	Training Acc: 0.94342	Val Loss: 0.343	Val Acc: 0.8927
Epoch: 26/50	Training Loss: 0.161	Training Acc: 0.94654	Val Loss: 0.334	Val Acc: 0.8984
Epoch: 27/50	Training Loss: 0.152	Training Acc: 0.94792	Val Loss: 0.358	Val Acc: 0.8945
Epoch: 28/50	Training Loss: 0.144	Training Acc: 0.95132	Val Loss: 0.316	Val Acc: 0.9014
Epoch: 29/50	Training Loss: 0.133	Training Acc: 0.95486	Val Loss: 0.348	Val Acc: 0.8997
Epoch: 30/50	Training Loss: 0.128	Training Acc: 0.95614	Val Loss: 0.363	Val Acc: 0.8961
Epoch: 31/50	Training Loss: 0.122	Training Acc: 0.9584	Val Loss: 0.326	Val Acc: 0.9001
Epoch: 32/50	Training Loss: 0.118	Training Acc: 0.9597	Val Loss: 0.326	Val Acc: 0.902
Epoch: 33/50	Training Loss: 0.119	Training Acc: 0.95974	Val Loss: 0.358	Val Acc: 0.8995
Epoch: 34/50	Training Loss: 0.109	Training Acc: 0.96328	Val Loss: 0.315	Val Acc: 0.9069
Epoch: 35/50	Training Loss: 0.103	Training Acc: 0.96464	Val Loss: 0.358	Val Acc: 0.9015
Epoch: 36/50	Training Loss: 0.095	Training Acc: 0.9675	Val Loss: 0.348	Val Acc: 0.9044
Epoch: 37/50	Training Loss: 0.094	Training Acc: 0.9682	Val Loss: 0.326	Val Acc: 0.907
Epoch: 38/50	Training Loss: 0.087	Training Acc: 0.97094	Val Loss: 0.341	Val Acc: 0.9067
Epoch: 39/50	Training Loss: 0.090	Training Acc: 0.96958	Val Loss: 0.331	Val Acc: 0.9065
Epoch: 40/50	Training Loss: 0.084	Training Acc: 0.97034	Val Loss: 0.347	Val Acc: 0.9108
Epoch: 41/50	Training Loss: 0.081	Training Acc: 0.97124	Val Loss: 0.340	Val Acc: 0.903
Epoch: 42/50	Training Loss: 0.075	Training Acc: 0.97416	Val Loss: 0.325	Val Acc: 0.9101
Epoch: 43/50	Training Loss: 0.073	Training Acc: 0.97534	Val Loss: 0.340	Val Acc: 0.9071
Epoch: 44/50	Training Loss: 0.071	Training Acc: 0.97654	Val Loss: 0.344	Val Acc: 0.91
Epoch: 45/50	Training Loss: 0.068	Training Acc: 0.977	Val Loss: 0.340	Val Acc: 0.9073
Epoch    45: reducing learning rate of group 0 to 1.0000e-03.
Epoch: 46/50	Training Loss: 0.038	Training Acc: 0.98764	Val Loss: 0.305	Val Acc: 0.9206
Epoch: 47/50	Training Loss: 0.028	Training Acc: 0.99118	Val Loss: 0.318	Val Acc: 0.9221
Epoch: 48/50	Training Loss: 0.024	Training Acc: 0.99198	Val Loss: 0.328	Val Acc: 0.9231
Epoch: 49/50	Training Loss: 0.020	Training Acc: 0.99322	Val Loss: 0.333	Val Acc: 0.9235
Epoch: 50/50	Training Loss: 0.019	Training Acc: 0.99384	Val Loss: 0.340	Val Acc: 0.9233

Does this look okay?

Also, I had following doubts in my mind.
The validation loss decreases very minimally after epoch 12. However, the validation accuracy keeps on improving till the end. But then again, it overfits by the time training ends. Around which epoch should the training should have stopped? Should I have used EarlyStopping for it or is there some other approach?

Thank you being patient with these doubts.

Can someone please look at this and answer my queries?

Thanks

Your code seems fine. Yes the model is overfitting but at least the test accuracy is decent as well. To improve it you probably just want to keep messing around with your data transforms. Try different ones out and see if it improves.

Thank you @Dwight_Foster for the review. I’ll try more transformations as you suggested.

I have one query. You are training the model with input image size 32*32 right? Have you tried resizing it to (224,224)