Model is not learning. Accuracy and Loss stay the same for 100 epochs

Hello,
I have a very specific question about a model which I am trying to train but something goes wrong and I dont know why.

I am training a network to recognize hand gestures. I am using the data LeapGestRecog from kaggle. I have augmented the data by taking the mirrored versions of the images as well.

I will start by showing you the dataset.

This is how my data looks like. Each picture is 120 by 320 grayscale and I am using a batch size of 100 because otherwise I am running out of memory. The training data set contains 34000 images while the validation set only 3000.

The network looks as follows:

class Model(nn.Module):
    def __init__(self,input_size=32, hidden_size=64,n_classes=10):
        """ Define our model """
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1,input_size, kernel_size=(3,3),stride=(1,1),padding=1)
        self.relu1 = nn.ReLU()
        self.maxp1 = nn.MaxPool2d(kernel_size=(2,2))
        self.conv2 = nn.Conv2d(input_size,hidden_size,kernel_size=(3,3),padding=1)
        self.relu2 = nn.ReLU()
        self.maxp2 = nn.MaxPool2d(kernel_size=(2,2))
        self.conv3 = nn.Conv2d(hidden_size,128,kernel_size=3,padding=1)
        self.maxp3 = nn.MaxPool2d(kernel_size=(2,2))
        self.l1    = nn.Linear(128 * 15 * 40,640)
        self.relul = nn.ReLU()
        self.l2    = nn.Linear(640,128)
        self.l3    = nn.Linear(128,n_classes)
        self.soft  = nn.Softmax(1)


    def forward(self, x):
        """ The forward pass of our model """
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.maxp1(x)
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.maxp2(x)
        x = self.conv3(x)
        x = self.maxp3(x)
        x = x.view(x.size(0),-1)
        x = self.l1(x)
        x = self.l2(x)
        x = self.l3(x)
        x = self.soft(x)
        return x

The training function is a “standard procedure”

def train_model(model,train_data,valid_data,learning_rate,num_epochs,optimizer,criterion):
    """ Training procedure of the model together with accuracy and loss for both data sets """
    train_loss = np.zeros(num_epochs)
    valid_loss = np.zeros(num_epochs)
    train_accuracy = np.zeros(num_epochs)
    valid_accuracy = np.zeros(num_epochs)
    
    """begin training"""
    for epoch in range(num_epochs):
        model.train()
        train_losses = []
        train_correct= 0
        total_items  = 0

        valid_losses = []
        valid_correct = 0

        for images,labels in train_data:

            images = images.float()
            labels = labels.long()
            optimizer.zero_grad()

            """add to GPU hopefully"""
            images = images.to(device)
            labels = labels.to(device)

            """Forward pass"""
            outputs = model.forward(images)
            loss    = criterion(outputs,labels)

            """Backward pass"""
            loss.backward()
            optimizer.step()

            """staticstics"""
            train_losses.append(loss.item())
            _, predicted = torch.max(outputs.data,1)
            train_correct += (predicted == labels).sum().item()
            total_items += labels.size(0)

        train_loss[epoch] = np.mean(train_losses)
        train_accuracy[epoch] = (1 * train_correct/total_items)

        with torch.no_grad():
            correct_val = 0
            total_val = 0

            for images,labels in valid_data:

                images = images.float()
                labels = labels.long()

                images = images.to(device)
                labels = labels.to(device)

                outputs = model.forward(images)
                loss    = criterion(outputs, labels)

                valid_losses.append(loss.item())
                _, predicted = torch.max(outputs.data, 1)

                correct_val += (predicted == labels).sum().item()
                total_val   += labels.size(0)

        valid_loss[epoch] = np.mean(valid_losses)
        valid_accuracy[epoch] = (1 * correct_val/total_val)

        print("Epoch: [{},{}], train accuracy: {:.4f}, valid accuracy: {:.4f}, train loss: {:.4f}, valid loss: {:.4f}"
        .format(num_epochs,epoch+1,train_accuracy[epoch],valid_accuracy[epoch],train_loss[epoch],valid_accuracy[epoch]))

    return model, train_accuracy, train_loss, valid_accuracy, valid_loss

This is how I call the train function:

network = Model()
network = network.to(device)

optimizer = torch.optim.SGD(network.parameters(),lr=0.01,momentum=0.9)
criterion = nn.CrossEntropyLoss()

model, train_accuracy, train_loss, valid_accuracy, valid_loss = train_model(model=network,train_data=train,valid_data=valid,learning_rate=0.001,num_epochs=100,optimizer=optimizer,criterion=criterion)

print("Ready")

I tried tweaking the model parameters such as learning rate but no matter what I do the accuracy and loss stays the same. Please help me, I dont know what is going wrong.

You use CrossEntropyLoss which according to the documentation:

This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class.

And your networks outputs a softmax, so thats softmax followed by LogSoftmax, I suggest either use NLLLoss, or remove your last softmax layer

Should you choose to remove the softmax from your model, make sure to adjust the accuracy metrics to accept Logits instead of Probabilities.

Roy

With nn.CrossEntropyLoss you don’t need this softmax layer. Simply, the logit outputs (from the l3 layer) can be passed as the input to the criterion

Unfortunately that was not the problem. Do you know what else could be causing it ?

Thanks, I chose to remove it but now I am getting nans which is probably because I did not readjust my metrics. Could you explain what you meant ? Also I was advised to add another relu after the third convolution.

Its good that you removed the softmax, so now you have only 1 softmax (in CrossEntropyLoss), as expected.

Adding another Relu might help, but its not a game changer because nonlinearity already exists in your model.

Does the nans from your loss? or from your accuracy metrics?
Since its a classification challenge, i would say that CategoricalAccuracy is needed
Ive taken this implementation from lpd

    def categorical_correct(y_pred, y_true):
        indices = T.max(y_pred, 1)[1]
        correct = T.eq(indices, y_true).view(-1)
        return correct.float().sum()

so in your example, you should do:
train_correct += categorical_correct(predicted, labels)
and for val:
correct_val += categorical_correct(predicted, labels)

Roy

I did that as well but the nans that I get are on the loss. I think that something is going wrong with the data… Otherwise I dont know what could possibly be going wrong. Do you have any other suggestions ?

Can’t really tell unless you provide more details about the data, and even give some data points examples. (Labels included)

Are you normalizing the input pixels?
Meaning, does each channel from 0-255? Or 0-1? Consider divide by 255 to map to [0,1] if you didn’t do that.

Roy

Try updating your learning rate. I usually divide learning rate by 10 when two consecutive loss are within 0.01 difference. You can use a learning rate scheduler from here torch optim (ReduceLROnPlateau might be the one) or you can make a simple scheduler yourself.

# Outside the epoch loop
previous_loss = 0
l_rate = learning_rate #l_rate is here to keep track of new learning rate

# Inside the epoch loop at the end 
if abs((previous_loss - loss.item()) <= 0.01:
    l_rate /= 10
    optimizer = torch.optim.SGD(model.parameters(),lr=l_rate,momentum=0.9)
previous_loss = loss.item()