Hello everyone.
I am stuck on this problem for 2 days now.
I wrote my model and loaded data and started training but the accumulated loss is not changing ans so is the accuracy and I figured out that what is supposed to happen after optimize.step()
is not actually happening.
I would be very grateful if anyone can propose any solution.
It feels there are many sources of error, could you share your code?
which part of it exactly ?
Code so it’s possible to run it and see if I could find any bug. If the code isn’t working I would try to break the code down into simpler parts to try and locate the error, how did you come to the conclusion its the optimizer.step() that’s causing it?
Edit: Or just write the code for the training part of the network? That might be enough
I tried to print the loss before and after the optimize.step() and it didnt change.
As for the code, here it is:
optimizer = optim.Adam(net.parameters(), lr=0.1)
net = conv_Net()
device = 'cuda' if torch.cuda.is_available() else 'cpu'
net = net.to(device)
net = torch.nn.DataParallel(net, device_ids=range(torch.cuda.device_count()))
cudnn.benchmark = True
loss_function=nn.CrossEntropyLoss()
for batch in trainloader:
images, labels = batch
images=images.to(device)
labels=labels.to(device)
optimizer.zero_grad()
preds = net(images)
loss = loss_function(preds,labels)
loss.backward()
optimizer.step()
total_loss += loss.item()
total_correct += get_correct(preds, labels)
It might be that the learning rate is too high, have you tried lowering it?
yes, I did. I tried it as 0.01 but the problem persists
I would try lowering it more like 1e-3, 1e-4 just to make sure. Also what is the data you’re training on? Have you tried running it through a very simple dataset like mnist to make sure it’s in this part of the code
Ok, I will lowering it as you said.
The data is CIFAR-10. No, I didn’t I was planning on testing other data but I didn’t
CIFAR-10 should be fine. It could also be the model, could you share the code for it? The training part looks fine except the learning rate looked way too high. I would add another loop going through the epochs if you haven’t done it already, might take a couple of epochs to see if it improves substantially.
Lower learning rate didn’t do anything. The mode is :
(am using the epoch loop, just for simplicity I didn’t write it)
class conv_Net(nn.Module):
def __init__(self):
super(conv_Net,self).__init__()
self.bn1=nn.BatchNorm2d(3)
self.Conv1=convNxN(3,64,3)
self.Conv2=convNxN(64,64,3)
self.bn2=nn.BatchNorm2d(64)
self.Conv3=convNxN(64,16,1)
self.FullC=nn.Linear(in_features=2304,out_features=10)
def forward(self,x):
out=activ_Function(self.Conv1(self.bn1(x)))
out=activ_Function(self.Conv2(out))
out=F.avg_pool2d(out,(2,2),2,0,False,True,1)
out=self.bn2(activ_Function(self.Conv3(out)))
out=F.avg_pool2d(out,(2,2),2,0,False,True,1)
out=out.view(out.size(0),-1)
out=self.FullC(out)
return out
Looking at the training part you’re defining net = conv_Net() after you’ve used Adam(net.parameters()), so this would cause an error. The model is a little confusing, what is convNxN and activ_Function? Also BatchNorm2d should be used after conv and before activation function. I’ve rewritten your code and the following works to run for me
class conv_Net(nn.Module):
def __init__(self):
super(conv_Net,self).__init__()
self.bn1=nn.BatchNorm2d(64)
self.bn2=nn.BatchNorm2d(16)
self.Conv1=nn.Conv2d(3,64,3)
self.Conv2=nn.Conv2d(64,64,3)
self.Conv3=nn.Conv2d(64,16,1)
self.FullC = nn.Linear(in_features=784,out_features=10)
def forward(self,x):
out=F.relu(self.bn1(self.Conv1(x)))
out=F.relu(self.Conv2(out))
out=F.avg_pool2d(out,(2,2),2,0,False,True,1)
out=F.relu(self.bn2(self.Conv3(out)))
out=F.avg_pool2d(out,(2,2),2,0,False,True,1)
out=out.view(out.size(0),-1)
out=self.FullC(out)
return out
# Load Data
batch_size=64
train_dataset = datasets.CIFAR10(root='dataset/', train=True, transform=transforms.ToTensor(), download=True)
trainloader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
net = conv_Net()
optimizer = optim.Adam(net.parameters(), lr=1e-4)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
net = net.to(device)
loss_function=nn.CrossEntropyLoss()
for epoch in range(10):
losses = []
for batch in trainloader:
images, labels = batch
images=images.to(device)
labels=labels.to(device)
optimizer.zero_grad()
preds = net(images)
loss = loss_function(preds,labels)
loss.backward()
optimizer.step()
losses.append(loss.item())
print(sum(losses)/len(losses))
EDIT: Changed to use F.relu as this might be more clear for you.
Thank you for the proposition and your time. I will verify all of this and report back to you.
In fact, the instanciation of the model is in the right place in my real code (that was a huge mistake of mine when rewriting the code here)
Why are you calling Relu at the begining of the model and I didn’t know about the order of calling like batch after that and before that. Thank you so much
Hopefully you get it to work now, look at my edited code I changed and used F.relu instead as it might be more clear to you. Using nn.ReLU() I’m only creating the module and using it multiple times in the forward() part.
thank you man.
Have a good and safe day