CNN overfitting

Hi everyone! I’m fairly new to deep learning and I’m trying to build an image classification model. Currently, there are four classes of various military aircraft (F15, F16, F18 and F35, the dataset was downloaded from here: Military Aircraft Detection Dataset | Kaggle) with roughly 1300-1500 images each (I only took the ones from the ‘crop’ folder). I split each class’s images into 80% training set, 10% validation set and 10% test set.
The only transforms I applied were Resize (to 128 * 128) and ToTensor. I’m using the following architecture for the model:

class AircraftVision(nn.Module):

def __init__(self, input_shape: int, hidden_units: int, output_shape: int):
    super().__init__()
    self.conv_block_1 = nn.Sequential(
        nn.Conv2d(in_channels=input_shape,
                  out_channels=hidden_units,
                  kernel_size=3,
                  padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2),
        nn.Dropout(p=0.5)
    )
    self.conv_block_2 = nn.Sequential(
        nn.Conv2d(in_channels=hidden_units,
                  out_channels=output_shape,
                  kernel_size=3,
                  padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2),
        nn.Dropout(p=0.5)
    )
    self.conv_block_3 = nn.Sequential(
        nn.Conv2d(in_channels=output_shape,
                  out_channels=128,
                  kernel_size=3,
                  padding=1),z
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2),
        nn.Dropout(p=0.5)
    )
    self.classifier = nn.Sequential(
        nn.Flatten(),
        nn.Linear(in_features=output_shape*32*16,
                 out_features=256),
        nn.ReLU(),
        nn.Linear(in_features=256, out_features=len(class_names))
    )

def forward(self, x):
    x = self.conv_block_1(x)
    x = self.conv_block_2(x)
    x = self.conv_block_3(x)
    x = self.classifier(x)
    return x

The loss function is CrossEntropyLoss and the optimizer is Adam (learning rate = 0.001, weight decay = 1e-5). The number of epochs is 30.

After several attempts at training the model, I always have a pretty high training accuracy (~95%) and low training loss.However, the validation loss falls slightly (with the accuracy rising) but usually around epoch 15 the accuracy fluctuates at around 60%-65%, while the loss rises slightly but steadily.

As you can see, I already tried adding dropout layers and weight decay. I’ve played around with the learning rate, weight decay, kernel_size and padding in the convolutional layers and a few other minor changes to the architecture.

I have several suspicions why the model is overfitting so much (in no particular order):

  1. There’s something wrong with my architecture.
  2. The data I’m using is not good enough, as some of the images are of pretty low quality.
  3. The training/validaiton loop I’m using (slightly modified from one of Sebastian Raschka’s books) is wrong. Here it is:

def train_val(model, num_epochs, train_dl, val_dl):

loss_hist_train = [0] * num_epochs
accuracy_hist_train = [0] * num_epochs
loss_hist_valid = [0] * num_epochs
accuracy_hist_valid = [0] * num_epochs
for epoch in range(num_epochs):
    model.train()
    for x_batch, y_batch in train_dl:
        x_batch, y_batch = x_batch.to(device), y_batch.to(device)
        pred = model(x_batch)
        loss = loss_fn(pred, y_batch)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train[epoch] += loss.item()*y_batch.size(0)
        is_correct = (torch.argmax(pred, dim=1) == y_batch).float()
        accuracy_hist_train[epoch] += is_correct.sum()
    loss_hist_train[epoch] /= len(train_dl.dataset)
    accuracy_hist_train[epoch] /= len(train_dl.dataset)

    model.eval()
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            x_batch, y_batch = x_batch.to(device), y_batch.to(device)
            pred = model(x_batch)
            loss = loss_fn(pred, y_batch)
            loss_hist_valid[epoch] += loss.item()*y_batch.size(0)
            is_correct = (torch.argmax(pred, dim=1) == y_batch).float()
            accuracy_hist_valid[epoch] += is_correct.sum()
    loss_hist_valid[epoch] /= len(valid_dl.dataset)
    accuracy_hist_valid[epoch] /= len(valid_dl.dataset)

    print(f"Epoch: {epoch+1} | Train loss: {loss_hist_train[epoch]:.3f} | Train accuracy: {accuracy_hist_train[epoch]:.3f} | Validation loss: {loss_hist_valid[epoch]:.3f} | Validation accuracy: {accuracy_hist_valid[epoch]:.3f}")
return loss_hist_train, loss_hist_valid, accuracy_hist_train, accuracy_hist_valid

Sorry if this is a bit too much, but I’ve been working on this model for several days now and this overfitting is really starting to bug me. Does anyone have any advice on how to deal with this issue? Many thanks in advance!

Here are the results of my latest run. The only difference here is that I tried resizing the images to 224 *224. As you can see, the model start to overfit at epoch 10, after which the loss starts to rise and the accuracy starts to fluctuate:

Epoch: 1 | Train loss: 1.496 | Train accuracy: 0.260 | Validation loss: 1.383 | Validation accuracy: 0.261
Epoch: 2 | Train loss: 1.383 | Train accuracy: 0.273 | Validation loss: 1.382 | Validation accuracy: 0.271
Epoch: 3 | Train loss: 1.381 | Train accuracy: 0.282 | Validation loss: 1.383 | Validation accuracy: 0.271
Epoch: 4 | Train loss: 1.382 | Train accuracy: 0.285 | Validation loss: 1.379 | Validation accuracy: 0.288
Epoch: 5 | Train loss: 1.366 | Train accuracy: 0.296 | Validation loss: 1.345 | Validation accuracy: 0.346
Epoch: 6 | Train loss: 1.292 | Train accuracy: 0.387 | Validation loss: 1.296 | Validation accuracy: 0.414
Epoch: 7 | Train loss: 1.174 | Train accuracy: 0.470 | Validation loss: 1.249 | Validation accuracy: 0.447
Epoch: 8 | Train loss: 1.028 | Train accuracy: 0.551 | Validation loss: 1.194 | Validation accuracy: 0.495
Epoch: 9 | Train loss: 0.860 | Train accuracy: 0.640 | Validation loss: 1.165 | Validation accuracy: 0.534
Epoch: 10 | Train loss: 0.672 | Train accuracy: 0.720 | Validation loss: 1.143 | Validation accuracy: 0.565
Epoch: 11 | Train loss: 0.516 | Train accuracy: 0.797 | Validation loss: 1.276 | Validation accuracy: 0.530
Epoch: 12 | Train loss: 0.413 | Train accuracy: 0.832 | Validation loss: 1.488 | Validation accuracy: 0.549
Epoch: 13 | Train loss: 0.344 | Train accuracy: 0.866 | Validation loss: 1.365 | Validation accuracy: 0.563
Epoch: 14 | Train loss: 0.292 | Train accuracy: 0.886 | Validation loss: 1.491 | Validation accuracy: 0.567
Epoch: 15 | Train loss: 0.264 | Train accuracy: 0.895 | Validation loss: 1.463 | Validation accuracy: 0.586
Epoch: 16 | Train loss: 0.226 | Train accuracy: 0.915 | Validation loss: 1.733 | Validation accuracy: 0.602
Epoch: 17 | Train loss: 0.196 | Train accuracy: 0.927 | Validation loss: 1.537 | Validation accuracy: 0.576
Epoch: 18 | Train loss: 0.195 | Train accuracy: 0.924 | Validation loss: 1.714 | Validation accuracy: 0.567
Epoch: 19 | Train loss: 0.167 | Train accuracy: 0.933 | Validation loss: 1.625 | Validation accuracy: 0.578
Epoch: 20 | Train loss: 0.164 | Train accuracy: 0.938 | Validation loss: 1.844 | Validation accuracy: 0.561
Epoch: 21 | Train loss: 0.160 | Train accuracy: 0.938 | Validation loss: 1.963 | Validation accuracy: 0.557
Epoch: 22 | Train loss: 0.145 | Train accuracy: 0.945 | Validation loss: 1.734 | Validation accuracy: 0.600
Epoch: 23 | Train loss: 0.142 | Train accuracy: 0.943 | Validation loss: 1.819 | Validation accuracy: 0.596
Epoch: 24 | Train loss: 0.144 | Train accuracy: 0.946 | Validation loss: 2.218 | Validation accuracy: 0.586
Epoch: 25 | Train loss: 0.130 | Train accuracy: 0.951 | Validation loss: 1.947 | Validation accuracy: 0.574
Epoch: 26 | Train loss: 0.124 | Train accuracy: 0.953 | Validation loss: 2.069 | Validation accuracy: 0.596
Epoch: 27 | Train loss: 0.125 | Train accuracy: 0.956 | Validation loss: 1.830 | Validation accuracy: 0.578
Epoch: 28 | Train loss: 0.105 | Train accuracy: 0.963 | Validation loss: 2.143 | Validation accuracy: 0.607
Epoch: 29 | Train loss: 0.096 | Train accuracy: 0.964 | Validation loss: 1.934 | Validation accuracy: 0.596
Epoch: 30 | Train loss: 0.102 | Train accuracy: 0.965 | Validation loss: 2.452 | Validation accuracy: 0.605

  1. Would suggest using a pretrained ResNet-18. Given its overfitting on such a small network, that maybe your best bet. Training from scratch isn’t really beneficial in smaller datasets. You could try mobilenet as well. Going on this, resizing to 224*224 may be the smart move.

  2. You could try cross validation. Maybe your split is just difficult.

Thank you very for your suggestions! I really appreciate it!

  1. The thing is, I really want to build something from scratch so that I can learn how to do things properly. You mentioned that my network/datasets are small, do you think that adding more data and/or layers to the model would help?

  2. Ah, that’s a great idea! I’ll definitely try some cross-validation. The split I did was admittedly lame, since I simply copied the penultimate and ultimate 10% of the data and made them into a validation and test set.

Does anything else come to mind? Is my training/validation loop ok?

Once again, thanks a lot!

  1. BatchNorm2d after your Conv2d layers may help.
  2. p = 0.5 is pretty high for dropout.
  3. What is the ratio of classes in your train dataset?
1 Like

Thanks a lot! I’ll try the Batchnorm2d and lower the dropout rate, 0.3 should be ok, I assume?

I’m not at home right now, but the ratio is something like 1100-1200 images for two of the four classes and around 1200-1300 images for the other two. Not quite balanced, but still better than nothing.

You can try building resnet-18 from scratch then :slight_smile: My motive in saying this was, training from scratch isnt too hot with lesser data. Something pretrained would do much better!

I suggested a smaller/small network because your model seems to already overfit. Definitely more data would help.

Basic idea of the train val loops seem to be fine.

I too agree batchnorm is a good idea.

If you really want to stick to this, you could try running a hyperparameter grid search. But I still would generate a baseline with a pretrained model to understand what sort of performance you need to improve on (assuming it does improve).

A couple more considerations/things you could try:

  1. SGD optimizer tends to work better for generalization (which is what you’re looking for here). If compute isn’t an issue, wouldn’t hurt to try. See here and here.
  2. If the dataset is unbalanced, you could pass in the weight argument into the CE loss function.
  3. Are you using any data augmentation in your dataloader? If not, I suggest putting in a lot of augmentation, in your case, given the small dataset. See here and here.

Right, I see what you mean now. I guess if all else fails, I could look up the architecture of Resnet-18 and try to build it from scratch :slight_smile:
I’ll also try to look for some more data in the meantime. Thanks for your help!

I’ll definitely try the SGD optimizer, I just didn’t know that it was better for generalization.
As I said, the only data augmentation I used was Resize and ToTensor, but I’ll take a look at all the links you sent me and try to implement it. I guess this is a case of “How much regularization do you want? - Yes.” :slight_smile:

I’ve seen some references already to data augmentation (which you should definitely be using), but I suggest checking out Albumentations. Its a library designed to help you perform a vast number of different augmentations.

Also what is the point of this project? To learn how to debug some of the issues of your DL model or are you going for peak performance to solve a particular use case that has to do with planes? If its the ladder, I suggest fine-tuning an existing model. Fine-tuning is probably the most useful thing in all of deep learning for anyone who wants a SoTA model with zero cost.

Welcome to the world of Deep Learning!

Thanks a lot, I’ll look into Albumentations and some other data augmentation techniques!

The purpose of this project is indeed to learn how to build and train a model from scratch. I’ve already learned so much during the process and I feel like I’ll learn a lot more when I implement all the suggestions in this thread. Plus, I really want to unlock the “Build a CNN From Scratch” achievement :slight_smile:

1 Like

I think that’s a great idea. I’d highly suggest just building a regular NN from scratch first if you haven’t already. A CNN which adds in feature maps and a convolutional kernel and then having to update those kernels through backprop is a LOT to do from scratch.