VGG-16 training time Google Collab

Hi,

I’m using Google Collab on an Nvidia Tesla P100 with 16gb gpu memory. I used vgg-16 without batch norm. I freezed all layers except the first one, which I use to go from 1 to 3 channels, and the ones from the classifier. Here is a snippet from my code:

assert self.image_size == 224, "ERROR: Wrong image size."

            model = torchvision.models.vgg16(pretrained=True) if self.model_type == 'vgg-16' else torchvision.models.vgg19(pretrained=True)

            if self.input_ch != 3:

                first_conv_layer = [nn.Conv2d(self.input_ch, 3, kernel_size=3, stride=1, padding=1, dilation=1, groups=1, bias=True)]

                first_conv_layer.extend(list(model.features))  

                model.features= nn.Sequential(*first_conv_layer)  

            model.classifier[-1] = nn.Linear(4096, 1000)

            model.classifier.add_module('7', nn.ReLU())

            model.classifier.add_module('8', nn.Dropout(p=0.5, inplace=False))

            model.classifier.add_module('9', nn.Linear(1000, self.output_ch))

            model.classifier.add_module('10', nn.LogSoftmax(dim=1))

            

            for param in model.features[1:].parameters(): # disable grad for trained layers

                param.requires_grad = False

I trained it one the Fashion MNIST dataset, which has num_channels=1, that’s why I use an extra layer at the beginning. I added a nn.LogSoftmax(dim=1) at the last layer because I’m using nn.NLLLoss().

My question is: For 50k images on training and 10k on validation, learning_rate=0.0001, batch_size=64 and Adam optimizer it took about 3.5 hours for 20 epochs. Is that normal?

~10mins per epoch might be reasonable for this model on your setup.
I tried it on my machine with a Titan V and get ~4mins per epoch:

import time

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import torchvision.models as models
import torchvision.datasets as datasets
import torchvision.transforms as transforms

# setup
dataset = datasets.FashionMNIST(
    root='/home/pbialecki/python/data',
    transform=transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor()]))

loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4)

device = 'cuda'
model = models.vgg16(pretrained=True)
model.features[0] = nn.Conv2d(1, 64, 3, 1, 1)
model.classifier[-1] = nn.Linear(4096, 1000)
model.classifier.add_module('7', nn.ReLU())
model.classifier.add_module('8', nn.Dropout(p=0.5, inplace=False))
model.classifier.add_module('9', nn.Linear(1000, 10))
model.classifier.add_module('10', nn.LogSoftmax(dim=1))

for param in model.features[1:].parameters():
    param.requires_grad = False

model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.NLLLoss()

for epoch in range(3):
    torch.cuda.synchronize()
    t0 = time.time()
    for data, target in loader:
        optimizer.zero_grad()
        data = data.to(device)
        target = target.to(device)
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
    torch.cuda.synchronize()
    t1 = time.time()
    print('Epoch {}, loss {}, time {}'.format(
        epoch, loss.item(), (t1 - t0)))
2 Likes

Hi ptrblck,

First of all thanks for the constant help you provide to all the users here in the forums.
You can see a comparison between Titan and Tesla P100 here: https://www.videocardbenchmark.net/compare/TITAN-V-vs-Tesla-P100-PCIE-16GB/3859vs4039. As it seems, yours is much better thus computation time seems reasonable to me as well. I have some other things going on in the background that might take ~1-2 seconds extra, like calculating some statistics, keeping track of everything on tensorboard etc. Also, I noticed 2 more things:

  • I cannot achieve training or validation accuracy higher than 82%. I know that I can use some data augmentation techniques (like flipping, rotating etc., even though I didn’t), but I doubt this will get it over 90%. Is that normal? After examining a bunch of images in a batch on tensorboard, I noticed that all images after resize are losing some information (it’s like extra zoomed), and this might be the case that I cannot achieve better scores.

  • On some runs, validation scores are better than training ones. There might be a case that validation images might be easier to classify than the training ones. What I don’t get is why this happens. I splitted to training and validation by using torch.utils.data.random_split(dataset, [50000, 10000]).

  1. transforms.Resize should only resize the images without any cropping or zooming. Are you using any other transformations, e.g. RandomResizedCrop?

  2. If you are using dropout, the training loss might be higher than the validation loss, as the model has a lower capacity during training than the final validation model. You could call model.eval() and run the complete training data after an epoch to get a better approximation. However, this is usually not necessary, but could be a good way to make sure the loss gap is created by dropout or the running estimates of the training loss.

  1. I’m using RandomResizedCrop, so that’s causing this zooming I guess? And if so, this may be the case of preventing my network to achieve better loss right?

  2. Oh yes, I totally forgot that model.eval() skips the dropout. I’ll try your suggestion.

The scaling of RandomResizedCrop could be too aggressive for your model.
If I’m not mistaken, it worked well using the default setup in Inception models, but you might need to reduce the scaling for your use case.

1 Like

Thanks for the reply. I created a thread for my general approach, would be nice if you could give some insight: Transfer learning using VGG-16 (or 19) for regression