Training Loss Increasing after each epoch

amiitabbipy · June 15, 2020, 7:06am

I am working on a simple challenge to predict shape images. My training loss keeps increasing.
Why would this be the case for such a use case prediction?

download a pre-trained network

model = models.densenet121(pretrained=True)

freeze the features of the pre-trained network

for param in model.parameters():
param.requires_grad = False

model.classifier = ShapesNetwork.ShapesNet(1024,2,[256])
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.classifier.parameters(), lr=1e-4)
print(model)

ShapesNetwork.train(model,trainloader,testloader,criterion,optimizer,epochs=3)
print results are:
Epochs: 1/3 Training Loss: 0.874
Epochs: 2/3 Training Loss: 1.576
Epochs: 3/3 Training Loss: 2.260

My data set has 100 images each for circles and for squares.

ptrblck · June 16, 2020, 3:39am

It’s a bit hard to debug without seeing the code, but the loss might increase e.g. if you are not zeroing out the gradients, use a wrong output for the currently used criterion, use a too high learning rate etc.

Feel free to post more information about the training routine so that we can help further.

amiitabbipy · June 16, 2020, 4:05am

Thanks for your response. I am using below training routine.

def train(model,trainloader,testloader,criterion,optimizer,epochs=3):
steps = 0
running_loss = 0

for e in range(epochs):
    model.train()
    for images,labels in trainloader:
        steps += 1
        optimizer.zero_grad()
        output = model.forward(images)
        loss = criterion(output,labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        
    else:
        print("Epochs: {}/{}".format(e+1,epochs),
              "Training Loss: {:.3f}".format(running_loss/len(trainloader)))

ptrblck · June 16, 2020, 8:42am

The code looks alright, so I would recommend to play around with some hyper-parameters and e.g. lower the learning rate.
Your dataset is already small, but it might also be helpful for the sake of debugging to just use a few samples (e.g. 10 samples) and make sure the loss decreases and your model is able to overfit this small data sample.

amiitabbipy · June 16, 2020, 9:22am

Thank You. I will try.
Really appreciate your response.
I would like to chat with you if possible.

chrver · April 19, 2021, 4:34pm

Not sure if I should write on this topic but I have the same problem, the loss I get doesn’t seem to decrease.
I have played around with learning rate and batch size and at best I can get it to be around constant. Only when I train on a very small size with a specific rate and batch size I can get it to decrease.
My data length is ~200K, how many epochs and layers would I normally need for that?
Also, what is the loss I should be happy with? Is it relative with respect to the initial loss or should I aim for an absolute interval close to zero?

import torch
import numpy as np
import torchvision
from torchvision import transforms, datasets
import torch.nn as nn
import torch.nn.functional as F
import pickle
import torch.optim as optim
from tqdm import tqdm
import time

class Net(nn.Module):

    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(5, 128)
        self.bn1 = nn.BatchNorm1d(128)
        self.fc2 = nn.Linear(128, 128)    
        self.bn2 = nn.BatchNorm1d(128)
        self.fc3 = nn.Linear(128, 128) 
        self.bn3 = nn.BatchNorm1d(128)
        self.fc4 = nn.Linear(128, 2) 

    def forward(self,x):
        x = F.relu(self.bn1(self.fc1(x)))
        x = F.relu(self.bn2(self.fc2(x)))
        x = F.relu(self.bn3(self.fc3(x)))
        x = self.fc4(x)
        return F.log_softmax(x,dim=1)
    
    
    
with open('x-u-data.pkl','rb') as f:  
    training_data = pickle.load(f)    
    
 
 
net = Net()    


optimizer = optim.Adam(net.parameters(),lr=10**(-3))
EPOCHS = 10_000

X = torch.zeros( (len(training_data), 5) )
Y = torch.zeros( (len(training_data), 2) )

for ii in range(len(training_data)):
    cur_data = training_data[ii]
    X[ii,:] = torch.FloatTensor(cur_data[0:5]).view(-1,5)
    Y[ii,:] = torch.FloatTensor(cur_data[5:7])

BATCH_SIZE = 200
for epoch in tqdm(range(EPOCHS)):
    for ii in tqdm(range(0, len(training_data), BATCH_SIZE)):    
        #barch of feature sets and labels
       
        batch_X = X[ii:ii+BATCH_SIZE].view(-1, 5)
        batch_Y = Y[ii:ii+BATCH_SIZE]        
        
        net.zero_grad() #resetting gradient
        output = net(batch_X)
        
        loss = F.mse_loss(output,batch_Y)
        loss.backward()
        optimizer.step()
  
    print(loss)

ptrblck · April 20, 2021, 5:05am

F.mse_loss with log probabilities as the model output seems a bit unusual.
What kind of use case are you working on and which values to the targets have?
F.log_softmax is often used with nn.NLLLoss for a multi-class classification use case.

chrver · April 20, 2021, 7:14am

So I am trying to learn a control action of a robot. The input data is a 5-D vector that consists of 2-D position, 2-D velocity, and time (0-80 sec - normalized in (0,1)) and the output is a 2-D control action. Should I use maybe NLLLoss?
Edit: I do not have a classification problem, so maybe MSE with some other network output?

ptrblck · April 20, 2021, 8:27am

MSE might still work, but would depend on the target range I guess.
Note that log_softmax will return values in the range [-Inf, 0]. If your targets have another range, such as [0, 1], you should consider removing this activation or use another one.

chrver · April 20, 2021, 8:56am

My output values are in [-10^3, 10^3] or something like that. Should I preprocess them so that they belong to [0,1] or [-Inf,0]? Or is it easier to adjust the Network appropriately?

simon-schaefer · April 20, 2021, 1:17pm

softmax and related activation functions are typically used for binary classification tasks, since they are “pushing” model decisions to either side. Since you are predicting a (probably continuous) control vector, you are dealing with a regression task, so that it would probably work best to use no non-linear activation at the model’s output and preprocess the output range to a range like [-1, 1].

chrver · April 20, 2021, 3:31pm

Thanks, I normalized everything in [0,1] and used a relu activation in the end, it seems to work better.

Al-amin_Ibrahim · January 4, 2023, 12:06am

Good evening sir, I am having the same problem too, I will really really Appreciate your help sir
[Uploading: Screenshot (9).png.

Ye_Fang · July 11, 2023, 1:04pm

Thanks for advice! I had the same problem and found I didn’t zero out the grads. This really saves my day.