Why my loss doesn't go down past 0.69?

I new to Pytorch so I decided to start with cats vs dogs dataset. The problem which I don’t understand is that for some reason no matter how I change my model my loss doesn’t decrease below 0.69. Loss reaches this number after first epoch and after that no matter how many epochs it is not changing (in 10 epochs it reached 0.687). I have tried different layer sizes, diferent in-out chanels numbers, different loss functions (BCELoss, CrossEntropyLoss), different optimizers(Adam, SGD). Is there possibility that my training data is badly constructed, or maybe something else?

CODE:

import os
import cv2
import numpy as np
import torch
from torch.utils.data import DataLoader, TensorDataset
from torch import nn
import torch.nn.functional as F
from torch.optim import Adam, SGD
from pkbar import Kbar
import pickle


class DogCatClassifier(nn.Module):

    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 64, 5)
        self.conv2 = nn.Conv2d(64, 128, 5)
        self.conv3 = nn.Conv2d(128, 256, 5)
        self.conv4 = nn.Conv2d(256, 512, 3)

        self.fc1 = nn.Linear(512, 64)
        self.fc2 = nn.Linear(64, 32)

        self.fc_out = nn.Linear(32, 1) # For BCLoss
        # self.fc_out = nn.Linear(32, 2)

    def forward(self, tensor):
        tensor = F.max_pool2d(F.relu(self.conv1(tensor)), (2, 2))
        tensor = F.max_pool2d(F.relu(self.conv2(tensor)), (2, 2))
        tensor = F.max_pool2d(F.relu(self.conv3(tensor)), (2, 2))
        tensor = F.max_pool2d(F.relu(self.conv4(tensor)), (2, 2))

        tensor = torch.flatten(tensor, start_dim=1)

        tensor = F.relu(self.fc1(tensor))
        tensor = F.relu(self.fc2(tensor))

        tensor = self.fc_out(tensor)
    
        # return tensor # For Cross Etropy loss
        return torch.sigmoid(tensor)


    def create_train_data(path, img_size):
        labels = []
        images = []

        for file in os.listdir(path):
            if 'cat' in file:
                labels.append(0)
            else:
                labels.append(1)

        image = cv2.imread(path + file, cv2.IMREAD_GRAYSCALE)
        image = cv2.resize(image, (img_size, img_size))
        image = np.array(image)

        images.append(image)

        images = np.array(images)

        x_train = torch.tensor(images, dtype=torch.float32)
        y_train = torch.tensor(labels, dtype=torch.long)

        x_train = x_train / 255.0
        x_train = x_train.unsqueeze(1)  # Because it is grayscale image

        return x_train, y_train


EPOCHS = 5
BATCH_SIZE = 100
IMG_SIZE = 64

# x_train, y_train = create_train_data('cats_vs_dogs/train/train/', IMG_SIZE)
# with open('x_train.pickle', 'wb') as file:
#     pickle.dump(x_train, file)
#
# with open('y_train.pickle', 'wb') as file:
#     pickle.dump(y_train, file)

with open('x_train.pickle', 'rb') as file:
    x_train = pickle.load(file)

with open('y_train.pickle', 'rb') as file:
    y_train = pickle.load(file)

y_train = torch.tensor(y_train, dtype=torch.float)

train_dataset = TensorDataset(x_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)

model = DogCatClassifier().cuda()

optimizer = Adam(model.parameters(), lr=0.01)
# optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9) # Tried momentum: 0.1-0.9

# loss_function = nn.CrossEntropyLoss()
# loss_function = nn.MSELoss()
loss_function = nn.BCELoss()

for epoch in range(EPOCHS):
    print()
    print(f"Epoch: {epoch + 1}/{EPOCHS}")
    kbar = Kbar(target=len(x_train) / BATCH_SIZE, width=32)
    i = 0

    for images, labels in train_loader:
        images = images.cuda()
        labels = labels.cuda()

        preds = model(images)
    
        loss = loss_function(preds, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        kbar.update(i, values=[("loss", loss)])
        i += 1
2 Likes

The loss calculation for nn.BCELoss looks wrong, as this criterion expects the model outputs to be probabilities provided via a sigmoid activation, while you are applying torch.max on it.
Besides that the code looks alright and I cannot find anything obviously wrong.
I think you’ll find some posts here in the forum which are discussing this particular dataset, which might contain some code as a starter.

" The loss calculation for nn.BCELoss looks wrong, as this criterion expects the model outputs to be probabilities provided via a sigmoid activation, while you are applying torch.max on it" yeah this was my bad as I just copied the current at the time code without checking that it works, I updated the code so now BCE looks good, but still loss don’t do down past 0.693.

If I use nn.BCEWithLogitsLoss() instead, what is the proper input of the nn.BCEWithLogitsLoss() ? Do I have to scale the output value of the NN to be 0 to 1 (i.e., Softmax function)? Or do I need other activation functions… ?

Also, what about nn.CrossEntropyLoss() ? What is the proper input for nn.CrossEntropyLoss(). ?

Both loss functions (nn.BCEWithLogitsLoss and nn.CrossEntropyLoss) expect logits as the model output, so no activation should be used. Internally these loss functions will apply the corresponding activations functions.

Thank you very much. I have got it working.

At the moment, my loss function (nn.BCEWithLogitsLoss) is stucking at 0.3132 …

  • Is the lowest possible value of nn.BCEWithLogitsLoss a zero value?

  • Would you comment on what could have been going wrong with the fact that the loss will not decrease from 0.3132?

  • At the moment, I use the BCEWithLogitsLoss() to train with the positive labels only. Is it okay to use such loss function in this way? For example, at the moment, all of my targets also have value “1.0” (float). I have converted them from boolean value, i.e., if the target value is “True”, then it is converted to 1.0, otherwise 0.0.

I guess that the reason for having loss value 0.313 is because all of the value from my model output is “1.0”… (it is representing a correlation matrix). I guess that the fact that it has the value of 0.3132 because torch.sigmoid (1.0) is about 0.7. Then, log of (0.7) is - 0.3132…

  • The reason for my model output to take value 1 is because it is representing a correlation matrix where I have normalized the output features from my neural network, and perform vector multiplication to get the correlation matrix. However, this seems to put a limitation on how far the BCEwithLogisticLoss will progress.

  • However, if I did not normalize the output features, I will have a correlation matrix with very high value (1e19). Then, the BCEwithLogisticLoss will have a very l large value such as 7x 10^12…
    Could you recommend a good way to handle this situation?

The min. achievable loss value depends on your setup, but in the default setup a theoretical zero loss could be reached. In practical use cases you would come clone to a zero value, but would most likely never reach it perfectly due to e.g. the limited precision using floating point values (i.e. the gradient updates might never update the parameters to the “perfect” parameter set).

If you only train with positive labels, your model would most likely learn to only predict these positive labels, which would explain why the outputs are always ones.

I’m not familiar with your use case so you would need to check, if nn.BCEWithLogitsLoss is the right loss function for it. It’s usually used for a binary or multi-label classification.

2 Likes

Thank you very much. I see. It seems this is depending on my applications, so yes I will check with other loss functions.

Hi, This is my code for the CIFAR-10 dataset:

Just sharing the code from the trainloader step. If you are still stuck I can share the entire notebook.

trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True, num_workers=2)
valloader = torch.utils.data.DataLoader(valset, batch_size=64, shuffle=False, num_workers=2)

dataiterator = iter(trainloader)
samples = next(dataiterator)

# making the architecture
class Architecture(nn.Module):
    
    def __init__(self, input_num_channels, num_classes):
        super(Architecture, self).__init__()
        
        self.conv_blocks = nn.Sequential(

            nn.Conv2d(in_channels=input_num_channels, out_channels=32, kernel_size=(3,3), padding=(1,1), stride=(2,2), dilation=(2,2), bias=True),
            nn.BatchNorm2d(num_features=32),
            nn.LayerNorm([32,15,15], elementwise_affine=True),
            nn.InstanceNorm2d(num_features=32, affine=True),
            nn.GroupNorm(num_groups=4, num_channels=32, affine=True),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=(2,2), padding=(0,0), stride=(1,1), dilation=(2,2))
            
        )
        
        self.linear_blocks = nn.Sequential(

            nn.Linear(in_features=13*13*32, out_features=2000, bias=True),
            nn.BatchNorm1d(num_features=2000),
            nn.LayerNorm([2000], elementwise_affine=True),
            nn.GroupNorm(num_groups=100, num_channels=2000, affine=True),
            nn.Tanh(),
            nn.Linear(in_features=2000, out_features=num_classes, bias=True)
        )
        
    def forward(self, x):
        
        x = self.conv_blocks(x)
        x = x.view(x.size(0), -1)
        x = self.linear_blocks(x)
        
        return x
    


def weight_init(m):
    if isinstance(m, nn.Conv2d):
        nn.init.xavier_normal_(m.weight)
#         nn.init.kaiming_normal_(m.bias)
        
    elif isinstance(m, (nn.Linear, nn.BatchNorm1d)):
        nn.init.normal_(m.weight)
        nn.init.normal_(m.bias)
        
        
net = Architecture(3, 10)
net.apply(weight_init)


criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(tenet.parameters(), lr=0.001, momentum=0.9)

num_epochs=10

for epoch in range(num_epochs):
    
    print("Epoch:", epoch+1)
    
    running_loss = 0.0
    
    
    print(optimizer.param_groups[0]['lr'])
    
    
    for index, data in enumerate(trainloader):

        inputs, labels = data
        
        optimizer.zero_grad()
        
        outputs = tenet(inputs)
#         print(outputs.shape)
#         print(labels.shape)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        print(loss.item())
        
        running_loss += loss.item()
    
    scheduler.step()
    
    print("training loss after epoch", epoch+1, ":", running_loss/len(trainloader))
    
    train_accuracy = calculate_accuracy(trainloader)
    print("train accuracy after epoch", epoch+1, ":", train_accuracy)
    
    val_accuracy = calculate_accuracy(valloader)
    print("val accuracy after epoch", epoch+1, ":", val_accuracy)

This was probably the first code for me in pytorch. After that, I learned a lot.
That it is best to define a function for training and testing and then calling them in the for epoch loop.
Like:

def train(model: nn.Module,
          iterator: torch.utils.data.DataLoader,
          optimizer: optim.Optimizer,
          criterion: nn.Module, 
          clip_value):

    model.train()

    epoch_loss = 0

    for _, (imgs, texts, text_seq_lens_list, targets) in enumerate(iterator):

        optimizer.zero_grad()

        output = model(imgs, texts, text_seq_lens_list)

        loss = criterion(output, targets)

        loss.backward()

        nn.utils.clip_grad_norm_(model.parameters(), clip_value)

        optimizer.step()

        epoch_loss += loss.item()
        print(loss.item())

    return epoch_loss / len(iterator)

num_epochs = 10
clip_value = 1.0

for epoch in range(num_epochs):

    print(epoch)
    train_loss = train(model, trainloader, optimizer, criterion, clip_value)
    print("train loss:", train_loss)

I didn’t write this piece, I found it somewhere on github. I think on Ben Trevett’s github repo.

Hey, to anyone out there still having this problem:
For me the relevant clue was turning down the learning rate. I work with Adam as optimizer and my learning rate of 0.01 seems to have caused this error of getting stuck at a loss of 0.69.
With a learning rate of 0.0001 everything works much better now.

This is an old thread so my potential solution is not for the OP but a general observation for others who may come across this: If you use a scheduler, some schedulers must be updated with scheduler.step() in the per-batch for loop, not per-epoch. (Like CyclicLR).

Not zeroing gradients inside the nested training loop with optimizer.zero_grad() could be the issue.

Other possible issues in binary classification:
For most, using BCEWithLogitsLoss is superior to using BCELoss, because it utilizes a log-sum-exp trick for better numerical stability around small numbers. (Deals with log(0) undefined by avoiding it). This also means more stable updates/gradient flows.

Not toggling model.train() vs model.eval() is another common source of issues.

Too low or too high of learning rates if not using a scheduler.

Finally if you’re finding you keep plateauing near 0.65~0.72 loss and seemingly “stuck” at 0.68 or 0.69: this is consistent with mistakenly double-applying a sigmoid function, for example, defining it in your stack (because you’re using BCELoss) but in making a prediction, applying it again.

Sigmoid(sigmoid(x)) of x~0.6-0.95 falls right in that range near 0.69 and look like white noise around the value. In the cases people have said they’ve plateaued at 0.69, a mistaken double sigmoid has most often been the case.

Relatedly, BCEWithLogitsLoss() requires there NOT be a final sigmoid() activation in the feed forward stack.

1 Like