Loss/accuracy is not changing; optmizer.step() is not working

Hello everyone.
I am stuck on this problem for 2 days now.
I wrote my model and loaded data and started training but the accumulated loss is not changing ans so is the accuracy and I figured out that what is supposed to happen after optimize.step() is not actually happening.
I would be very grateful if anyone can propose any solution.

It feels there are many sources of error, could you share your code?

which part of it exactly ?

Code so it’s possible to run it and see if I could find any bug. If the code isn’t working I would try to break the code down into simpler parts to try and locate the error, how did you come to the conclusion its the optimizer.step() that’s causing it?

Edit: Or just write the code for the training part of the network? That might be enough

I tried to print the loss before and after the optimize.step() and it didnt change.

As for the code, here it is:

optimizer = optim.Adam(net.parameters(), lr=0.1) 
net = conv_Net()
device = 'cuda' if torch.cuda.is_available() else 'cpu'
net = net.to(device)
net = torch.nn.DataParallel(net, device_ids=range(torch.cuda.device_count()))
cudnn.benchmark = True
loss_function=nn.CrossEntropyLoss()
for batch in trainloader:
        images, labels = batch
        images=images.to(device)
        labels=labels.to(device)
        optimizer.zero_grad() 
        preds = net(images)
        loss = loss_function(preds,labels)
        loss.backward() 
        optimizer.step() 
        total_loss += loss.item()
        total_correct += get_correct(preds, labels)

It might be that the learning rate is too high, have you tried lowering it?

yes, I did. I tried it as 0.01 but the problem persists

I would try lowering it more like 1e-3, 1e-4 just to make sure. Also what is the data you’re training on? Have you tried running it through a very simple dataset like mnist to make sure it’s in this part of the code

Ok, I will lowering it as you said.
The data is CIFAR-10. No, I didn’t I was planning on testing other data but I didn’t

CIFAR-10 should be fine. It could also be the model, could you share the code for it? The training part looks fine except the learning rate looked way too high. I would add another loop going through the epochs if you haven’t done it already, might take a couple of epochs to see if it improves substantially.

Lower learning rate didn’t do anything. The mode is :
(am using the epoch loop, just for simplicity I didn’t write it)

class conv_Net(nn.Module):
    def __init__(self):
        super(conv_Net,self).__init__()
        self.bn1=nn.BatchNorm2d(3)
        self.Conv1=convNxN(3,64,3)
        self.Conv2=convNxN(64,64,3)
        self.bn2=nn.BatchNorm2d(64)
        self.Conv3=convNxN(64,16,1)
        self.FullC=nn.Linear(in_features=2304,out_features=10)
    def forward(self,x):
        out=activ_Function(self.Conv1(self.bn1(x)))
        out=activ_Function(self.Conv2(out))
        out=F.avg_pool2d(out,(2,2),2,0,False,True,1)
        out=self.bn2(activ_Function(self.Conv3(out)))
        out=F.avg_pool2d(out,(2,2),2,0,False,True,1)
        out=out.view(out.size(0),-1)
        out=self.FullC(out)
        return out

Looking at the training part you’re defining net = conv_Net() after you’ve used Adam(net.parameters()), so this would cause an error. The model is a little confusing, what is convNxN and activ_Function? Also BatchNorm2d should be used after conv and before activation function. I’ve rewritten your code and the following works to run for me

class conv_Net(nn.Module):
    def __init__(self):
        super(conv_Net,self).__init__()
        self.bn1=nn.BatchNorm2d(64)
        self.bn2=nn.BatchNorm2d(16)
        self.Conv1=nn.Conv2d(3,64,3)
        self.Conv2=nn.Conv2d(64,64,3)
        self.Conv3=nn.Conv2d(64,16,1)
        self.FullC = nn.Linear(in_features=784,out_features=10)
        
    def forward(self,x):
        out=F.relu(self.bn1(self.Conv1(x)))
        out=F.relu(self.Conv2(out))
        out=F.avg_pool2d(out,(2,2),2,0,False,True,1)
        out=F.relu(self.bn2(self.Conv3(out)))
        out=F.avg_pool2d(out,(2,2),2,0,False,True,1)
        out=out.view(out.size(0),-1)
        out=self.FullC(out)
        return out
    
# Load Data
batch_size=64
train_dataset = datasets.CIFAR10(root='dataset/', train=True, transform=transforms.ToTensor(), download=True)
trainloader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)

net = conv_Net()
optimizer = optim.Adam(net.parameters(), lr=1e-4) 

device = 'cuda' if torch.cuda.is_available() else 'cpu'
net = net.to(device)
loss_function=nn.CrossEntropyLoss()

for epoch in range(10):
    losses = []
    for batch in trainloader:
        images, labels = batch
        images=images.to(device)
        labels=labels.to(device)
        optimizer.zero_grad() 
        preds = net(images)
        loss = loss_function(preds,labels)
        loss.backward() 
        optimizer.step() 
        losses.append(loss.item())
  
    print(sum(losses)/len(losses))

EDIT: Changed to use F.relu as this might be more clear for you.

Thank you for the proposition and your time. I will verify all of this and report back to you.
In fact, the instanciation of the model is in the right place in my real code (that was a huge mistake of mine when rewriting the code here)
Why are you calling Relu at the begining of the model and I didn’t know about the order of calling like batch after that and before that. Thank you so much

Hopefully you get it to work now, look at my edited code I changed and used F.relu instead as it might be more clear to you. Using nn.ReLU() I’m only creating the module and using it multiple times in the forward() part.

1 Like

thank you man.
Have a good and safe day :smiley: