Cross Entropy loss is not decreasing

srikanthram · April 28, 2019, 5:15am

Hey everyone,

I have the following code set up.

sigm=torch…nn.Sigmoid()
z1 is an instance of a class that does batch normalization,relu and maxpooling

def reparam(a,b,h):
weight_m= (2sigm(b)-(2sigm(a)sigm(b))-1+sigm(a))
weight_v=(1-sigm(a))-weight_m2
om=F.conv2d(h,weight_m,padding=1)
ov=F.conv2d(h2,weight_v,padding=1)
e=torch.randn(ov.shape).cuda()
z=om+ove
return z1(z)

loss=torch.nn.CrossEntropyLoss()
optimizer=torch.optim.Adam([a1,b1,a2,b2,…,a6,b6],learning_rate,weight_decay)

for epoch in range(num_epochs):
for images,labels in train_loader;

forward propagation

op1=reparam(a1,b1,images)
op2=reparam(a2,b2,op1)
…
…
op6=reparam(a6,b6,op5)
…# code for fully connected layers

y2=softmax(op6,dim=1)
lossi=loss(y2,labels)
optimizer.zero_grad()
lossi.backward()
optimizer.step()

Hence, the parameters to train are a1,b1,… a6,b6.
In the above piece of code, my when I print my loss it does not decrease at all. It always stays the
same equal to 2.30
epoch 0 loss = 2.308579206466675
epoch 1 loss = 2.297269344329834
epoch 2 loss = 2.3083386421203613
epoch 3 loss = 2.3027005195617676
epoch 4 loss = 2.304455518722534
epoch 5 loss = 2.305694341659546
epoch 6 loss = 2.3002185821533203
epoch 7 loss = 2.304798126220703
epoch 8 loss = 2.3063807487487793
epoch 9 loss = 2.3052620887756348
epoch 10 loss = 2.2963013648986816
epoch 11 loss = 2.3032405376434326
epoch 12 loss = 2.293735980987549
epoch 13 loss = 2.30415940284729
epoch 14 loss = 2.3025383949279785
epoch 15 loss = 2.307767868041992
epoch 16 loss = 2.300485610961914
epoch 17 loss = 2.304170846939087
epoch 18 loss = 2.302550792694092
epoch 19 loss = 2.3051881790161133
epoch 20 loss = 2.3025758266448975

I am not sure where the problem is and i am trying hard to rectify the problem but not sure how to go about it. Any help in this would be highly appreciated,

I would like to know if there is something I am missing for the back-propagation to fail

ptrblck · April 28, 2019, 9:37pm

nn.CrossEntropyLoss expects raw logits as the model output, so you should remove the softmax and just pass op6 directly to your loss function.
Let me know, if that helped.

srikanthram · April 29, 2019, 12:05am

@ptrblck, I did try removing the softmax as well. In this case the error decreases but the accuracy does not improve.

ptrblck · April 29, 2019, 10:26am

Does the loss get stuck at some point?
Could you play around with the learning rate (lowering) and other hyper-parameters as well?

srikanthram · May 21, 2019, 12:16am

@ptrlbck, the loss does get stuck at 2.3025 and oscillates between 2.29 and 2.31
I tried changing the parameters like learning rate other hyper parameters. The strange part here is that the code works well on MNIST but not on CIFAR-10/SVHN datasets

rasbt · May 21, 2019, 1:07am

You shouldn’t pass the softmax into the CrossEntropy loss. It computes log_softmax(y2) internally, so you end up with with log_softmax(softmax(z)), which would make for a pretty awkward gradient. That was actually frequent issue among my students so I made a kidn of cheatsheet for them: Why are there so many ways to compute the Cross Entropy Loss in PyTorch and how do they differ?

@ptrblck, I did try removing the softmax as well. In this case the error decreases but the accuracy does not improve.

That’s good. Now that the loss decreases, the next step is to find out why the accuracy doesn’t increase. Can you show the function that computes the accuracy?

srikanthram · May 21, 2019, 1:33am

@rasbt, thank you for your reply.

I have enclosed my entire code in here. The parameters of my network are intialized using the the trained weights of a full precision neural network with the same architecture

#load all the necessary libraries
import torch
import torch.nn
import numpy as np
torch.backends.cudnn.deterministic=True
#from utils import plot_images
from torchvision import datasets
from torchvision import transforms
from torch.utils.data.sampler import SubsetRandomSampler
import torchvision
import torch.nn.functional as F
from torch.distributions import Categorical

device=torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
sigm=torch.nn.Sigmoid()


#import the CIFAR-10 datasets

def get_train_valid_loader(data_dir,
                           batch_size,
                           augment,
                           random_seed,
                           valid_size=0.1,
                           shuffle=True,
                           show_sample=False,
                           num_workers=4,
                           pin_memory=False):
    error_msg = "[!] valid_size should be in the range [0, 1]."
    assert ((valid_size >= 0) and (valid_size <= 1)), error_msg

    normalize = transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2023, 0.1994, 0.2010],
    )

    # define transforms
    valid_transform = transforms.Compose([
            transforms.ToTensor(),
            normalize,
    ])
    if augment:
        train_transform = transforms.Compose([
            transforms.RandomCrop(32, padding=4),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize,
        ])
    else:
        train_transform = transforms.Compose([
            transforms.ToTensor(),
            normalize,
        ])

    # load the dataset
    train_dataset = datasets.CIFAR10(
        root=data_dir, train=True,
        download=True, transform=train_transform,
    )

    valid_dataset = datasets.CIFAR10(
        root=data_dir, train=True,
        download=True, transform=valid_transform,
    )

    num_train = len(train_dataset)
    indices = list(range(num_train))
    split = int(np.floor(valid_size * num_train))

    if shuffle:
        np.random.seed(random_seed)
        np.random.shuffle(indices)

    train_idx, valid_idx = indices[split:], indices[:split]
    train_sampler = SubsetRandomSampler(train_idx)
    valid_sampler = SubsetRandomSampler(valid_idx)

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, sampler=train_sampler,
        num_workers=num_workers, pin_memory=pin_memory,
    )
    valid_loader = torch.utils.data.DataLoader(
        valid_dataset, batch_size=batch_size, sampler=valid_sampler,
        num_workers=num_workers, pin_memory=pin_memory,
    )

    # visualize some images
    if show_sample:
        sample_loader = torch.utils.data.DataLoader(
            train_dataset, batch_size=9, shuffle=shuffle,
            num_workers=num_workers, pin_memory=pin_memory,
        )
        data_iter = iter(sample_loader)
        images, labels = data_iter.next()
        X = images.numpy().transpose([0, 2, 3, 1])
        plot_images(X, labels)

    return (train_loader, valid_loader)

train_loader,valid_loader=get_train_valid_loader(data_dir='C://Users//AEON-LAB PC//.spyder-py3//CIFAR_10',
                           batch_size=128,
                           augment=True,
                           random_seed=999,
                           valid_size=0.2,
                           shuffle=True,
                           show_sample=False,
                           num_workers=1,
                           pin_memory=True)

#Class that performs convolution operation according to the reparameterization trick (Shayer et.al)
class Repcnn(torch.nn.Module):
  def __init__(self,wfp):
    super(Repcnn,self).__init__()
    self.a,self.b=self.initialize(wfp)
    #print(self.a.norm())
  def initialize(self,wfp):
    wtilde=wfp/torch.std(wfp)
    sigma_a=0.95-((0.95-0.05)*torch.abs(wtilde))
    sigma_b=0.5*(1+(wfp/(1-sigma_a)))
    sigma_a=torch.clamp(sigma_a,0.05,0.95)
    sigma_b=torch.clamp(sigma_b,0.05,0.95)
    a=torch.log(sigma_a/(1-sigma_a)).requires_grad_().cuda()
    b=torch.log(sigma_b/(1-sigma_b)).requires_grad_().cuda()
    return torch.nn.Parameter(a),torch.nn.Parameter(b)
  
  def forward(self,x):
    
    weight_m= (2*sigm(self.b)-(2*sigm(self.a)*sigm(self.b))-1+sigm(self.a))
    #print(self.a.norm())
    weight_v=(1-sigm(self.a))-weight_m**2
    assert torch.all(weight_v>=0)
    om=F.conv2d(x,weight_m,padding=1)
    ov=F.conv2d(x**2,weight_v,padding=1)
    assert torch.all(ov>=0)
    #e=torch.randn_like(ov).cuda()
    e=torch.randn_like(ov).cuda()
    z=om+(ov*e)
    return z
  
#class that performs the the linear operation in the fully connected layers using reparmeterization trick 
class Repfc(torch.nn.Module):
  def __init__(self,wfp):
    super(Repfc,self).__init__()
    self.a1,self.b1=self.initialize(wfp)
  def initialize(self,wfp):
    
    wtilde=wfp/torch.std(wfp)
    sigma_a=0.95-((0.95-0.05)*torch.abs(wtilde))
    sigma_b=0.5*(1+(wfp/(1-sigma_a)))
    sigma_a=torch.clamp(sigma_a,0.05,0.95)
    sigma_b=torch.clamp(sigma_b,0.05,0.95)
    a=torch.log(sigma_a/(1-sigma_a))
    b=torch.log(sigma_b/(1-sigma_b))
    return torch.nn.Parameter(a),torch.nn.Parameter(b) 
  
  
  def forward(self,x):
    
    weight_m=(2*sigm(self.b1)-(2*sigm(self.a1)*sigm(self.b1))-1+sigm(self.a1))
    weight_v=(1-sigm(self.a1))-weight_m**2
    om=torch.matmul(weight_m,x)
    ov=torch.matmul(weight_v,x**2)
    #e=torch.randn_like(ov).cuda()
    e=torch.randn_like(ov).cuda()
    z=om+(ov*e)
    
    return z
 
#weight intialization using the full precision trained network
model=torch.load('/content/vgg_8_without_relu.pth',map_location='cpu')
wfp=[]
wfp.append(model['features.0.weight'])
wfp.append(model['features.3.weight'])
wfp.append(model['features.7.weight'])
wfp.append(model['features.10.weight'])
wfp.append(model['features.14.weight'])
wfp.append(model['features.17.weight'])
wfp.append(model['classifier.1.weight'])
wfp.append(model['classifier.2.weight'])


for i in range(len(wfp)):
  wfp[i]=torch.Tensor(wfp[i])
  
 #Forward propagation and training
class Conv_Net(torch.nn.Module):
  def __init__(self,wfp):
    super(Conv_Net,self).__init__()
    self.hidden=torch.nn.ModuleList([])
    self.batchnorm=torch.nn.ModuleList([])
    for i in range(6):
      cnn=Repcnn(wfp[i])
      self.hidden.append(cnn)
    for j in range(2):
      fc=Repfc(wfp[i+1])
      i+=1
      self.hidden.append(fc)
    batch_dim=[128,256,512]
    for i in batch_dim:
      self.batchnorm.append(torch.nn.BatchNorm2d(i))
    self.mp=torch.nn.MaxPool2d(kernel_size=2,stride=2)
    self.drop=torch.nn.Dropout()
    self.activation=torch.nn.ReLU()
  def forward(self,x):
    op=x
    j=0
    while(j<6):
      obj=self.hidden[j]
      obj_next=self.hidden[j+1]
      b=self.batchnorm[j//2]
      j+=2
      op=self.mp(self.activation(b(obj_next(self.activation(b(obj(op)))))))
    op=op.view(op.size(0),-1)
    op=torch.t(op)
    obj=self.hidden[j]
    op=obj(self.drop(op))
    j+=1
    obj=self.hidden[j]
    yout=obj(op)
    yout=torch.t(yout)
    #print(yout)
    return yout
  
net=Conv_Net(wfp).to(device)

def l2_reg():
  sum=0
  for p in net.parameters():
    sum+=p.norm(2)
  return sum

l_rate=0.01
#lr_decay=20
beta_param=1e-11
weight_decay=1e-11
optimizer=torch.optim.Adam(net.parameters(),lr=l_rate,weight_decay=weight_decay)
criterion=torch.nn.CrossEntropyLoss().cuda()
net.train()
num_epochs=300
for epoch in range(num_epochs):
  if(epoch==170):
    lr=0.001
    for param_group in optimizer.param_groups:
      param_group['lr']=lr
  for i,(images,labels) in enumerate(train_loader):
    images=images.to(device)
    labels=labels.to(device)
    #print(i)
    #torch.cuda.empty_cache()
    optimizer.zero_grad()
    yout=net(images)
    loss_batch=criterion(yout,labels)+(beta_param*l2_reg())
    loss_batch.backward()
    optimizer.step()
    
  print('epoch {}'.format(epoch),'loss {}'.format(loss_batch.item()))
  sum_grad=0
  for p in net.parameters():
    sum_grad+=p.grad.norm()
  print('sum of the gradients of all parameters in epoch{} is {}'.format(epoch,sum_grad))
#evaluation 
net.eval()
with torch.no_grad():
  correct=0
  total=0
  for images,labels in valid_loader:
    images=images.to(device)
    labels=labels.to(device)
    yout=net(images)
    _,predicted=torch.max(yout,1)
    total+=labels.size(0)
    correct+=(predicted==labels).sum().item()
  print('Test accuracy of the model on the 10000 test images:{}%'.format((correct/total)*100))

The problem here is that the loss decreases down untill 2.30258 and after it stays constant.

achaiah · May 21, 2019, 4:21am

If I’m reading it right you’re only dropping your learning rate once during the training. You should consider using a function of some sort to decrease the learning rate as your training progresses. My first guess would be that the LR you’re using is too high at the end to make meaningful progress beyond a certain point.

srikanthram · May 21, 2019, 4:36am

@achaiah, Thanks a lot for that suggestion. Any more suggestions from your end. What else could I potentially do to make the loss decrease further

achaiah · May 21, 2019, 4:42am

Well, as I said, I’d try that first. I’ve definitely run into situations where LR was preventing my networks from learning further. If that is indeed the case in your situation, nothing else you do will improve the loss because you just keep bouncing around the local minimum but never approach it.

As another trick, I’d start with SGD before moving on to Adam. In my own applications Adam converges faster but SGD still generalizes better in the end (by a few percent).

srikanthram · May 21, 2019, 4:49am

okay. Thank you for the suggestion. Will try em out

beerboaa · July 2, 2019, 3:20pm

@srikanthram
I got the same problem with the same loss of 2.30x.
How did you solve the problem?

mohsenkiskani · March 22, 2020, 7:21pm

Thanks for your comment @ptrblck. This helped me fix my code.

Rushank_Savant · September 3, 2020, 8:36pm

I was going through the same problem. Thought of playing with different Loss functions.
But finally changing Adam to SGD and small change in learning rate worked for me.
Thank you!

achaiah · September 3, 2020, 8:50pm

Glad you found this useful. In my own work I’ve never found Adam actually outperforming SGD but many people use Adam / AdamW for their training so I’m not sure if there are any more tricks that I’m missing.

jamiecao · February 8, 2021, 6:58am

I have run into the exact problem as you did, and I end up solving it after I thoroughly figured it out.

Though you might not need my help anymore, I think I’d better write it down so that people running into the same problem in the future can use my answer.

I’m gonna put the solution at the top, and then explain why this “loss not decreasing” error occurs so often and what it actually is later in my post. If you just want the solution, just check the following few lines.

SOLUTIONS:

Check if you pass the softmax into the CrossEntropy loss. If you do, correct it. For more information, check @rasbt’s answer above.
Use a smaller learning rate in the optimizer, or add a learning rate scheduler which will decrease the learning rate automatically during training.
Use SGD optimizer instead of Adam.

EXPLANATION:

In the above piece of code, my when I print my loss it does not decrease at all. It always stays the
same equal to 2.30
epoch 0 loss = 2.308579206466675
epoch 1 loss = 2.297269344329834
epoch 2 loss = 2.3083386421203613
epoch 3 loss = 2.3027005195617676
epoch 4 loss = 2.304455518722534
epoch 5 loss = 2.305694341659546
…

What is this error? Why did the loss stop decreasing and go slightly up and down in a small range?

It is not difficult to understand that the model has stuck in a local minima, which is a pretty poor one, since the performance of the model is basically equal to guessing randomly (actually, precisely equal to guessing randomly). Also, it’s a common one, since so many people have run into the same minima as you and I did.

So I check the parameters of a model that has been stuck in the minima, turns out that they are almost all negative, and their derivatives are all zero!

for i, para in enumerate(model.parameters()):
    print(f'{i + 1}th parameter tensor:', para.shape)
    print(para)
    print(para.grad)

What’s actually happening: negative parameters and ReLU activation, altogether cause the outputs of the middle layers to be all zeros, which means the parameters’ gradients become zero too (think about the chain rule). That’s why the parameters stop upgrading and the output is just a random guess. In this case, the bias of the last fully connected layers is the only parameter tensor that could have a “normal” gradient (which means not all zeros), so it gets updated every batch, causing the small ups and downs.

Therefore, one of the main solutions to this problem is to use a smaller learning rate so that our parameters don’t easily end up in such an “extreme” or “remote” area in the parameters space.

Rukan_Mahfuz · February 10, 2021, 7:40pm

I am trying to use softmax on a sequence to sequence problem. I have targets of two types. First target has number of tokens of 13 and second one has 100. The model has two outputs and i am computing loss for each type and sum them. I have the model output shape [seq_length, number of tokens, embedding dim] and output shape as [seq_length, number of tokens]. For nll_loss i am using log softmax in output dimension 1 then passing output to loss function by transposing last two dimensions like [seq_length, number of tokens, embedding dim]. Also in crossentropy i am passing raw logits of shape [seq_length, number of tokens]. But each case loss just fluctuates and does not decrease. What I am doing wrong here.

A short description of full process is as follows.
Inputs are a sequence of tuples like [(1,15), (2,27), (13,10)]. First i one hot encoded each token in each tuple. Then embed them so that first elements of each tuple have shape [10, embed dim], then add the two tensors of each tuple to create one embedding for each tuple. finally stack these to create the input of shape [3, 110, embed dim] And the two output logits have shape [3, 10, embedding dim] and [3, 100, embedding dim]. And if the target is ([2,27), (12,10), (5, 90)]. They are onehot encoded separately so the target 1 has shape [3, 10] and target shape is [3,100]'

what am i doing wrong here?? any idea??

Rohan_Nayak · July 17, 2024, 10:25pm

This is in case someone made a similar mistake like me. So, I was training a classifier for text classification and accidentally had GeLU activation after the last FeedForward layer. This caused my loss to not decrease below approx 0.64. Had I used ReLU activation, I would have been quicker to find the mistake. Please make sure that you don’t have an activation function (except sigmoid) after your FeedForward layer to make sure the network can output negative values as well.