My CNN Model is not Converging

aatiibutt · May 28, 2020, 7:53am

I am beginner in PyTorch; with the help of a YouTube channel “Deep Lizard” i learned about the tensors along with the operations. Now, i am working on a vehicle classification model, having 10 different classes of 1000 images each. But my model is not converging the training loss. I have attached the some chunks of code below. Please let me know, where i did mistake. Thanks.

#Resize Image
transform_in = transforms.Compose([
transforms.Resize([128, 128]),
transforms.ToTensor(),
transforms.Normalize((0.4915, 0.4823, 0.4468),
(0.2470, 0.2435, 0.2616))
])

#Model Class
class Net(nn.Module):
def init(self):
super().init()
self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
self.pool1 = nn.MaxPool2d(2,2)

    self.conv2 = nn.Conv2d(64, 32, kernel_size=3, padding=1)
    self.pool2 = nn.MaxPool2d(2,2)
    
    self.conv3 = nn.Conv2d(32, 16, kernel_size=3, padding=1)
    self.pool3 = nn.MaxPool2d(2,2)
    
    self.conv4 = nn.Conv2d(16, 8, kernel_size=3, padding=1)
    self.pool4 = nn.MaxPool2d(2,2)
    
    self.fc1 = nn.Linear(8 * 8 * 8, 288)
    self.fc2 = nn.Linear(288, 144)
    self.fc3 = nn.Linear(144, 72)
    self.fc4 = nn.Linear(72, 2)
    
def forward(self, x):
    out = F.max_pool2d(torch.relu(self.conv1(x)), 2)
    out = F.max_pool2d(torch.relu(self.conv2(out)), 2)
    out = F.max_pool2d(torch.relu(self.conv3(out)), 2)
    out = F.max_pool2d(torch.relu(self.conv4(out)), 2)
    out = out.view(-1, 8 * 8 * 8)
    out = torch.relu(self.fc1(out))
    out = torch.relu(self.fc2(out))
    out = torch.relu(self.fc3(out))
    out = torch.relu(self.fc4(out))        
    out = F.softmax(out, dim=1)
    return out

#Training
def training_loop(n_epochs, optimizer, model, loss_fn, train_loader):
for epoch in range(1, n_epochs + 1):
loss_train = 0.0
for imgs, labels in train_loader:
imgs = imgs.to(device=device) # <1>
labels = labels.to(device=device)
outputs = model(imgs)
loss = loss_fn(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        loss_train += loss.item()

    if epoch == 1 or epoch % 2 == 0:
        print('Epoch {}, Training loss {}' epoch, loss_train / len(train_loader)))

#Test
model = Net().to(device=device) # <1>
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
loss_fn = nn.CrossEntropyLoss()

training_loop(
n_epochs = num_epochs,
optimizer = optimizer,
model = model,
loss_fn = loss_fn,
train_loader = train_loader,
)

futscdav · May 28, 2020, 7:59am

From nn.CrossEntropyLoss docs:

The input is expected to contain raw, unnormalized scores for each class.

do not include the softmax in your forward.

aatiibutt · May 28, 2020, 9:52am

Thanks for your kind response. I updated the softmax with “self.fc4(out)”, and here is the output.

Epoch 1, Training loss 0.2394336342997849
Epoch 2, Training loss 0.41787283269688485
Epoch 4, Training loss 0.4808643462508917
Epoch 6, Training loss 0.4917321742884815
Epoch 8, Training loss 0.49367881812155245
Epoch 10, Training loss 0.4937179864756763
Epoch 12, Training loss 0.4934278305619955
Epoch 14, Training loss 0.4925368168763816
Epoch 16, Training loss 0.489936351031065
Epoch 18, Training loss 0.48225364279001953
Epoch 20, Training loss 0.41559089563786983
Epoch 22, Training loss 0.47623008105903863
Epoch 24, Training loss 0.47684788253158333
Epoch 26, Training loss 0.48494275493547323
Epoch 28, Training loss 0.46326695576310156
Epoch 30, Training loss 0.39736311135813596
Epoch 32, Training loss 0.14281618097797036
Epoch 34, Training loss 0.06763775760307908
Epoch 36, Training loss 0.03323963325470686
Epoch 38, Training loss 0.03329418990761042
Epoch 40, Training loss 0.04322414346039295
Epoch 42, Training loss 0.029511863365769386
Epoch 44, Training loss 0.022096360959112646
Epoch 46, Training loss 0.16103133413940668
Epoch 48, Training loss 0.38556999355554583
Epoch 50, Training loss 0.18612061651423573
Epoch 52, Training loss 0.026320594605058433
Epoch 54, Training loss 0.0228287323564291
Epoch 56, Training loss 0.032784194946289064
Epoch 58, Training loss 0.03217705018818379
Epoch 60, Training loss 0.10582428770139814
Epoch 62, Training loss 0.34886730622500184
Epoch 64, Training loss 0.3977765748649836
Epoch 66, Training loss 0.45065462989732624
Epoch 68, Training loss 0.22659578237682582
Epoch 70, Training loss 0.4815091462433338
Epoch 72, Training loss 0.4800715918652713
Epoch 74, Training loss 0.3046822458691895
Epoch 76, Training loss 0.13410430723801256
Epoch 78, Training loss 0.026426276881247757
Epoch 80, Training loss 0.018655495941638945

SANTOSH_S · May 28, 2020, 11:45am

Hi @aatiibutt
Its not good to see train loss increase. You can add Dropout() and batchnorm1d to your linear layers.
Make sure to give shuffle=true in the train dataloaders. Try these changes.
Its always a good practice to have a validation set which evaluates after each epoch.
Let me know if this worked.

ptrblck · May 29, 2020, 5:18am

Additionally to @SANTOSH_S’ answer, the learning rate might also be too high, as the loss is quite shaky.

chhaya_kumar_das · May 29, 2020, 7:42am

@aatiibutt If the problem still persists, try to use an LR scheduler