Training Loss Increasing after each epoch

I am working on a simple challenge to predict shape images. My training loss keeps increasing.
Why would this be the case for such a use case prediction?

download a pre-trained network

model = models.densenet121(pretrained=True)

freeze the features of the pre-trained network

for param in model.parameters():
param.requires_grad = False

model.classifier = ShapesNetwork.ShapesNet(1024,2,[256])
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.classifier.parameters(), lr=1e-4)
print(model)

ShapesNetwork.train(model,trainloader,testloader,criterion,optimizer,epochs=3)
print results are:
Epochs: 1/3 Training Loss: 0.874
Epochs: 2/3 Training Loss: 1.576
Epochs: 3/3 Training Loss: 2.260

My data set has 100 images each for circles and for squares.

It’s a bit hard to debug without seeing the code, but the loss might increase e.g. if you are not zeroing out the gradients, use a wrong output for the currently used criterion, use a too high learning rate etc.

Feel free to post more information about the training routine so that we can help further. :slight_smile:

1 Like

Thanks for your response. I am using below training routine.

def train(model,trainloader,testloader,criterion,optimizer,epochs=3):
steps = 0
running_loss = 0

for e in range(epochs):
    model.train()
    for images,labels in trainloader:
        steps += 1
        optimizer.zero_grad()
        output = model.forward(images)
        loss = criterion(output,labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        
    else:
        print("Epochs: {}/{}".format(e+1,epochs),
              "Training Loss: {:.3f}".format(running_loss/len(trainloader)))

The code looks alright, so I would recommend to play around with some hyper-parameters and e.g. lower the learning rate.
Your dataset is already small, but it might also be helpful for the sake of debugging to just use a few samples (e.g. 10 samples) and make sure the loss decreases and your model is able to overfit this small data sample.

1 Like

Thank You. I will try.
Really appreciate your response.
I would like to chat with you if possible.

Not sure if I should write on this topic but I have the same problem, the loss I get doesn’t seem to decrease.
I have played around with learning rate and batch size and at best I can get it to be around constant. Only when I train on a very small size with a specific rate and batch size I can get it to decrease.
My data length is ~200K, how many epochs and layers would I normally need for that?
Also, what is the loss I should be happy with? Is it relative with respect to the initial loss or should I aim for an absolute interval close to zero?

import torch
import numpy as np
import torchvision
from torchvision import transforms, datasets
import torch.nn as nn
import torch.nn.functional as F
import pickle
import torch.optim as optim
from tqdm import tqdm
import time

class Net(nn.Module):

    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(5, 128)
        self.bn1 = nn.BatchNorm1d(128)
        self.fc2 = nn.Linear(128, 128)    
        self.bn2 = nn.BatchNorm1d(128)
        self.fc3 = nn.Linear(128, 128) 
        self.bn3 = nn.BatchNorm1d(128)
        self.fc4 = nn.Linear(128, 2) 

    def forward(self,x):
        x = F.relu(self.bn1(self.fc1(x)))
        x = F.relu(self.bn2(self.fc2(x)))
        x = F.relu(self.bn3(self.fc3(x)))
        x = self.fc4(x)
        return F.log_softmax(x,dim=1)
    
    
    
with open('x-u-data.pkl','rb') as f:  
    training_data = pickle.load(f)    
    
 
 
net = Net()    


optimizer = optim.Adam(net.parameters(),lr=10**(-3))
EPOCHS = 10_000

X = torch.zeros( (len(training_data), 5) )
Y = torch.zeros( (len(training_data), 2) )

for ii in range(len(training_data)):
    cur_data = training_data[ii]
    X[ii,:] = torch.FloatTensor(cur_data[0:5]).view(-1,5)
    Y[ii,:] = torch.FloatTensor(cur_data[5:7])

BATCH_SIZE = 200
for epoch in tqdm(range(EPOCHS)):
    for ii in tqdm(range(0, len(training_data), BATCH_SIZE)):    
        #barch of feature sets and labels
       
        batch_X = X[ii:ii+BATCH_SIZE].view(-1, 5)
        batch_Y = Y[ii:ii+BATCH_SIZE]        
        
        net.zero_grad() #resetting gradient
        output = net(batch_X)
        
        loss = F.mse_loss(output,batch_Y)
        loss.backward()
        optimizer.step()
  
    print(loss)

F.mse_loss with log probabilities as the model output seems a bit unusual.
What kind of use case are you working on and which values to the targets have?
F.log_softmax is often used with nn.NLLLoss for a multi-class classification use case.

So I am trying to learn a control action of a robot. The input data is a 5-D vector that consists of 2-D position, 2-D velocity, and time (0-80 sec - normalized in (0,1)) and the output is a 2-D control action. Should I use maybe NLLLoss?
Edit: I do not have a classification problem, so maybe MSE with some other network output?

MSE might still work, but would depend on the target range I guess.
Note that log_softmax will return values in the range [-Inf, 0]. If your targets have another range, such as [0, 1], you should consider removing this activation or use another one.

My output values are in [-10^3, 10^3] or something like that. Should I preprocess them so that they belong to [0,1] or [-Inf,0]? Or is it easier to adjust the Network appropriately?

softmax and related activation functions are typically used for binary classification tasks, since they are “pushing” model decisions to either side. Since you are predicting a (probably continuous) control vector, you are dealing with a regression task, so that it would probably work best to use no non-linear activation at the model’s output and preprocess the output range to a range like [-1, 1].

Thanks, I normalized everything in [0,1] and used a relu activation in the end, it seems to work better.

Good evening sir, I am having the same problem too, I will really really Appreciate your help sir
[Uploading: Screenshot (9).png.

Thanks for advice! I had the same problem and found I didn’t zero out the grads. This really saves my day.