Hi,
I’m using Google Collab on an Nvidia Tesla P100 with 16gb gpu memory. I used vgg-16 without batch norm. I freezed all layers except the first one, which I use to go from 1 to 3 channels, and the ones from the classifier. Here is a snippet from my code:
assert self.image_size == 224, "ERROR: Wrong image size."
model = torchvision.models.vgg16(pretrained=True) if self.model_type == 'vgg-16' else torchvision.models.vgg19(pretrained=True)
if self.input_ch != 3:
first_conv_layer = [nn.Conv2d(self.input_ch, 3, kernel_size=3, stride=1, padding=1, dilation=1, groups=1, bias=True)]
first_conv_layer.extend(list(model.features))
model.features= nn.Sequential(*first_conv_layer)
model.classifier[-1] = nn.Linear(4096, 1000)
model.classifier.add_module('7', nn.ReLU())
model.classifier.add_module('8', nn.Dropout(p=0.5, inplace=False))
model.classifier.add_module('9', nn.Linear(1000, self.output_ch))
model.classifier.add_module('10', nn.LogSoftmax(dim=1))
for param in model.features[1:].parameters(): # disable grad for trained layers
param.requires_grad = False
I trained it one the Fashion MNIST dataset, which has num_channels=1
, that’s why I use an extra layer at the beginning. I added a nn.LogSoftmax(dim=1)
at the last layer because I’m using nn.NLLLoss()
.
My question is: For 50k images on training and 10k on validation, learning_rate=0.0001, batch_size=64 and Adam optimizer it took about 3.5 hours for 20 epochs. Is that normal?
~10mins per epoch might be reasonable for this model on your setup.
I tried it on my machine with a Titan V and get ~4mins per epoch:
import time
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import torchvision.models as models
import torchvision.datasets as datasets
import torchvision.transforms as transforms
# setup
dataset = datasets.FashionMNIST(
root='/home/pbialecki/python/data',
transform=transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor()]))
loader = DataLoader(
dataset,
batch_size=64,
shuffle=True,
num_workers=4)
device = 'cuda'
model = models.vgg16(pretrained=True)
model.features[0] = nn.Conv2d(1, 64, 3, 1, 1)
model.classifier[-1] = nn.Linear(4096, 1000)
model.classifier.add_module('7', nn.ReLU())
model.classifier.add_module('8', nn.Dropout(p=0.5, inplace=False))
model.classifier.add_module('9', nn.Linear(1000, 10))
model.classifier.add_module('10', nn.LogSoftmax(dim=1))
for param in model.features[1:].parameters():
param.requires_grad = False
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.NLLLoss()
for epoch in range(3):
torch.cuda.synchronize()
t0 = time.time()
for data, target in loader:
optimizer.zero_grad()
data = data.to(device)
target = target.to(device)
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
torch.cuda.synchronize()
t1 = time.time()
print('Epoch {}, loss {}, time {}'.format(
epoch, loss.item(), (t1 - t0)))
2 Likes
Hi ptrblck,
First of all thanks for the constant help you provide to all the users here in the forums.
You can see a comparison between Titan and Tesla P100 here: https://www.videocardbenchmark.net/compare/TITAN-V-vs-Tesla-P100-PCIE-16GB/3859vs4039. As it seems, yours is much better thus computation time seems reasonable to me as well. I have some other things going on in the background that might take ~1-2 seconds extra, like calculating some statistics, keeping track of everything on tensorboard etc. Also, I noticed 2 more things:
-
I cannot achieve training or validation accuracy higher than 82%. I know that I can use some data augmentation techniques (like flipping, rotating etc., even though I didn’t), but I doubt this will get it over 90%. Is that normal? After examining a bunch of images in a batch on tensorboard, I noticed that all images after resize are losing some information (it’s like extra zoomed), and this might be the case that I cannot achieve better scores.
-
On some runs, validation scores are better than training ones. There might be a case that validation images might be easier to classify than the training ones. What I don’t get is why this happens. I splitted to training and validation by using torch.utils.data.random_split(dataset, [50000, 10000])
.
The scaling of RandomResizedCrop
could be too aggressive for your model.
If I’m not mistaken, it worked well using the default setup in Inception models, but you might need to reduce the scaling for your use case.
1 Like
Thanks for the reply. I created a thread for my general approach, would be nice if you could give some insight: Transfer learning using VGG-16 (or 19) for regression