I don't know why my model's loss will not decrease on GPU

shazhongcheng · September 5, 2019, 4:41am

i am a new user of pytorch. During my study, i found my model work on CPU, but when i try to train it on
GPU, it doest work, and i dont know why… Can anybody help me?
Following issues:

class Net(nn.Module):
    def __init__(self):
        super(Net,self).__init__()
        self.conv1=nn.Conv2d(3,6,5)
        self.conv2=nn.Conv2d(6,16,5)
        self.fc1=nn.Linear(16*5*5,120)
        self.fc2=nn.Linear(120,84)
        self.fc3=nn.Linear(84,10)
    
    def forward(self,x):
        x=F.max_pool2d(F.relu(self.conv1(x)),(2,2))
        x=F.max_pool2d(F.relu(self.conv2(x)),2)
        x=x.view(x.size()[0],-1)
        x=F.relu(self.fc1(x))
        x=F.relu(self.fc2(x))
        x=self.fc3(x)
        return x
net=Net()
print(net)

for epoch in range(2):
    running_loss=0.0
    for i,data in enumerate(trainloader,0):
        inputs,labels=data
        inputs,labels=Variable(inputs),Variable(labels)
        
        optimizer.zero_grad()
        
        outputs=net(inputs)
        loss=criterion(outputs,labels)
        loss.backward()
        optimizer.step()
        
        running_loss+=loss.item()
        if i%2000==1999:
            print('[%5d %5d] loss: %.3f'%(epoch+1,i+1,running_loss/2000))
            running_loss=0.0
print('Finished Training')

It train work,and results:

[    1  2000] loss: 2.198
[    1  4000] loss: 1.879
[    1  6000] loss: 1.690
[    1  8000] loss: 1.579
[    1 10000] loss: 1.514
[    1 12000] loss: 1.476
[    2  2000] loss: 1.405
[    2  4000] loss: 1.384
[    2  6000] loss: 1.340
[    2  8000] loss: 1.352
[    2 10000] loss: 1.296
[    2 12000] loss: 1.302
Finished Training

but when on GPU:

device=t.device("cuda:0"if t.cuda.is_available() else "cpu")
net=net.to(device)
for epoch in range(2):
    running_loss=0.0
    for i,data in enumerate(trainloader,0):
        inputs,labels=data
        inputs, labels= inputs.to(device), labels.to(device)
        
        optimizer.zero_grad()
        
        outputs=net(inputs)
        loss=criterion(outputs,labels)
        loss.backward()
        optimizer.step()
        
        running_loss+=loss.item()
        if i%2000==1999:
            print('[%5d %5d] loss: %.3f'%(epoch+1,i+1,running_loss/2000))
            running_loss=0.0
print('Finished Training')

It doesn`t work,and results:

[    1  2000] loss: 2.304
[    1  4000] loss: 2.305
[    1  6000] loss: 2.306
[    1  8000] loss: 2.304
[    1 10000] loss: 2.304
[    1 12000] loss: 2.304
[    2  2000] loss: 2.305
[    2  4000] loss: 2.305
[    2  6000] loss: 2.305
[    2  8000] loss: 2.305
[    2 10000] loss: 2.303
[    2 12000] loss: 2.305
Finished Training

And i don`t know why.

phan_phan · September 5, 2019, 9:09am

It seems that in your cpu code you use Variable(), but you don’t use it in your gpu code. This might be the source of the problem.

Which version of PyTorch do you use ? Variable() is deprecated since v0.4

shazhongcheng · September 5, 2019, 12:35pm

Thank you! I will try.

My version is 1.0.0

mailcorahul · September 6, 2019, 7:23am

@phan_phan Why do we need gradients for inputs and labels?

phan_phan · September 6, 2019, 8:38am

@mailcorahul Variable did not necessarily mean requires_grad=True.
@shazhongcheng But as you are working with v1.0.0, I suggest you get rid of it in the cpu code.

I realize I didn’t answer at all your main question which was about you gpu code … sorry.

In your gpu code, you have :

Does this mean you defined net before ?
You could try to define optimizer on net.parameters() after calling net=net.to(device).

If this still doesn’t work, show us your whole code ! So that we can see what’s what.

mailcorahul · September 6, 2019, 9:07am

Okay.

Yes, I tried doing this. Seems to work fine both ways(which is not intuitive).
I guess it still isn’t the cause.

shazhongcheng · September 6, 2019, 2:08pm

Thank you！ you are really kind to reply me! Although i try use .cuda replace Variable and .to(device), it doesn`t work. But when I change my optimizer form SGD to Adam, the loss decrease ! And i am really really Happy!
But my most happy thing is you can reply me. It make me fell warm.
Last,Thank you! I finished it!
I know, maybe the learning_rate is to small, but it work on cpu…so strange.
my net(i run on jupyter):

class Net(nn.Module):
    def __init__(self):
        super(Net,self).__init__()
        self.conv1=nn.Conv2d(3,6,5)
        self.conv2=nn.Conv2d(6,16,5)
        self.fc1=nn.Linear(16*5*5,120)
        self.fc2=nn.Linear(120,84)
        self.fc3=nn.Linear(84,10)
    
    def forward(self,x):
        x=F.max_pool2d(F.relu(self.conv1(x)),(2,2))
        x=F.max_pool2d(F.relu(self.conv2(x)),2)
        x=x.view(x.size()[0],-1)
        x=F.relu(self.fc1(x))
        x=F.relu(self.fc2(x))
        x=self.fc3(x)
        return x
net=Net()

if t.cuda.is_available():
    net=net.cuda()

from torch import optim
criterion=nn.CrossEntropyLoss()
optimizer=optim.Adam(net.parameters(),lr=0.001)

for epoch in range(2):
    running_loss=0.0
    for i,data in enumerate(trainloader,0):
        inputs,labels=data
        if t.cuda.is_available():
            inputs=inputs.cuda()
            labels=labels.cuda()
        
        outputs=net(inputs)
        loss=criterion(outputs,labels)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        running_loss+=loss.item()
        if (i+1)%2000==1999:
            print('[%5d %5d] loss: %.3f'%(epoch+1,i+1,running_loss/2000))
            running_loss=0.0
print('Finished Training')

correct=0
total=0
for data in testloader:
    images,labels=data
    with t.no_grad():
        outputs=net(images.cuda()).cpu()
    _,predicted=t.max(outputs.data,1)
    total+=labels.size(0)
    correct+=(predicted==labels).sum()
print('test acc:%d %%'%(100*correct/total))

and results:

[    1  1999] loss: 1.874
[    1  3999] loss: 1.605
[    1  5999] loss: 1.514
[    1  7999] loss: 1.458
[    1  9999] loss: 1.422
[    1 11999] loss: 1.400
[    2  1999] loss: 1.316
[    2  3999] loss: 1.298
[    2  5999] loss: 1.280
[    2  7999] loss: 1.283
[    2  9999] loss: 1.274
[    2 11999] loss: 1.242
Finished Training

test acc:57 %

shazhongcheng · September 6, 2019, 2:12pm

Thank you! I have finished it! My solutions have replied.
Last,reall thank you!

mailcorahul · September 6, 2019, 2:45pm

Are you saying the same code with SGD instead of Adam optimizer doesn’t work?

AlittleBrave · April 1, 2020, 2:51pm

Thanks. Define optimizer after net.to(device) works.