Cannot train a simple CNN in tutorial on GPU

brianC · August 20, 2018, 12:57pm

Here is the network configuration

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net().to(device)

And here is train code

for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs
        inputs, labels = data
        inputs = inputs.to(device)
        labels = labels.to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        # print('passed')
        optimizer.step()


        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

And here is the error message

RuntimeError                              Traceback (most recent call last)
<ipython-input-20-6350afc1b67e> in <module>()
     17         loss.backward()
     18         print('pass')
---> 19         optimizer.step()
     20 
     21 

/usr/local/lib/python3.5/dist-packages/torch/optim/sgd.py in step(self, closure)
     99                     else:
    100                         buf = param_state['momentum_buffer']
--> 101                         buf.mul_(momentum).add_(1 - dampening, d_p)
    102                     if nesterov:
    103                         d_p = d_p.add(momentum, buf)

RuntimeError: Expected object of type torch.FloatTensor but found type torch.cuda.FloatTensor for argument #4 'other'

I’m new to PyTorch, and I just can’t figure out how can this still want torch.FloatTensor instead of torch.cuda.FloatTensor after I called .cuda().

Thanks.

justusschock · August 20, 2018, 1:27pm

How did you create your optimizer and your loss function?
I think the problem might be, that your optimizer is created with parameters already on GPU which causes an error since its internals are (and are supposed to be) on CPU. To fix that, you could try to create the optimizer before pushing the model to GPU.

jmandivarapu1 · August 20, 2018, 10:30pm

check this example below and try to create in that way where you define optimizer after moving the Net to GPU

github.com

pytorch/examples/blob/master/mnist/main.py

from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))

This file has been truncated. show original

Ranahanocka · August 21, 2018, 12:00am

You can check if your model is cuda via:

next(model.parameters()).is_cuda

brianC · August 21, 2018, 8:40am

Thanks, guys, I create the optimizer before I put the network in GPU and this problem solved.

But, I may have two new problems now. The train speed of GPU is way slower than CPU. When the network runs on CPU, it takes around 19 seconds per 2000 steps. when it runs on GPU, it takes around 31 seconds per 2000 steps.
And the loss of the model is very low from the beginning and it just not decreasing. Did I do something wrong?

Here is the network configuration

class Net(nn.Module):
    
    def __init__(self):
        super(Net, self).__init__()        
        # self.conv1 = nn.Conv2d(1, 6, 5) # for single-channel images
        self.conv1 = nn.Conv2d(3, 6, 5) # for RGB images
        # self.conv3 = nn.Conv2d(6, 16, 5)        
        self.conv3_0 = nn.Conv2d(3, 1, 5)
        self.conv3_1 = nn.Conv2d(4, 1, 5)
        self.conv3_2 = nn.Conv2d(6, 1, 5)        
        self.conv5 = nn.Conv2d(16, 120, 5)        
        self.fc6 = nn.Linear(120, 84)
        self.fc7 = nn.Linear(84, 10)
    
    def forward(self, x):
        x = self.conv1(x)
        x = F.max_pool2d(x, 2)
        x_chunks = torch.chunk(x, 6, dim=1)        
        chunk_batchs = []
        chunk_batchs.append(torch.cat(x_chunks[0: 3], dim=1))
        chunk_batchs.append(torch.cat(x_chunks[1: 4], dim=1))
        chunk_batchs.append(torch.cat(x_chunks[2: 5], dim=1))
        chunk_batchs.append(torch.cat(x_chunks[3: ], dim=1))
        chunk_batchs.append(torch.cat((x_chunks[0], x_chunks[4], x_chunks[5]), dim=1))
        chunk_batchs.append(torch.cat((x_chunks[0], x_chunks[1], x_chunks[5]), dim=1))        
        chunk_batchs.append(torch.cat(x_chunks[0: 4], dim=1))
        chunk_batchs.append(torch.cat(x_chunks[1: 5], dim=1))
        chunk_batchs.append(torch.cat(x_chunks[2: ], dim=1))
        chunk_batchs.append(torch.cat((x_chunks[0], x_chunks[3], x_chunks[4], x_chunks[5]), dim=1))
        chunk_batchs.append(torch.cat((x_chunks[0], x_chunks[1], x_chunks[4], x_chunks[5]), dim=1))
        chunk_batchs.append(torch.cat((x_chunks[0], x_chunks[1], x_chunks[2], x_chunks[5]), dim=1))
        chunk_batchs.append(torch.cat((x_chunks[0], x_chunks[1], x_chunks[3], x_chunks[4]), dim=1))
        chunk_batchs.append(torch.cat((x_chunks[1], x_chunks[2], x_chunks[4], x_chunks[5]), dim=1))
        chunk_batchs.append(torch.cat((x_chunks[1], x_chunks[2], x_chunks[3], x_chunks[5]), dim=1))

        out = []        
        for t_batch in chunk_batchs[0: 6]:
            out.append(self.conv3_0(t_batch))
        for f_batch in chunk_batchs[6: ]:
            out.append(self.conv3_1(f_batch))        
        out.append(self.conv3_2(x))
        x = torch.cat(out, dim=1)
      
        x = F.max_pool2d(x, 2)        
        x = self.conv5(x)        
        x = x.view(-1, 120)
        x = F.tanh(self.fc6(x))        
        x = self.fc7(x)

        return x 
  
net = Net()

And Here is my train code

for epoch in range(2):
    running_loss = 0.0
    s_time = time.clock()
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        labels = labels.view(bs, 1)
        onehot_label = torch.FloatTensor(bs, 10)
        onehot_label.zero_()
        onehot_label.scatter_(1, labels, 1)
        
        # will be remove on cpu version
        inputs = inputs.to(device)
        onehot_label = onehot_label.to(device)
        
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = MSE(outputs, onehot_label)
        loss.backward()
        optimizer.step()        
        running_loss += loss.item()
        if i % 2000 == 1999:
            r_time = time.clock() - s_time
            print('[%d, %5d] loss: %.3f time: %.3f'%
                  (epoch + 1, i + 1, running_loss / 2000, r_time))
            s_time = time.clock()
            running_loss = 0.0

print('Finished Training')

Here is train output on CPU

[1,  2000] loss: 0.088 time: 20.982
[1,  4000] loss: 0.085 time: 20.645
[1,  6000] loss: 0.083 time: 20.652
[1,  8000] loss: 0.081 time: 19.481
[1, 10000] loss: 0.080 time: 19.409
[1, 12000] loss: 0.080 time: 19.292
[2,  2000] loss: 0.079 time: 19.401
[2,  4000] loss: 0.079 time: 19.402
[2,  6000] loss: 0.078 time: 19.422
[2,  8000] loss: 0.078 time: 19.493
[2, 10000] loss: 0.077 time: 20.072
[2, 12000] loss: 0.077 time: 19.498
Finished Training

Here is GPU train output

[1,  2000] loss: 0.076 time: 30.982
[1,  4000] loss: 0.077 time: 36.717
[1,  6000] loss: 0.077 time: 31.032
[1,  8000] loss: 0.077 time: 31.602
[1, 10000] loss: 0.076 time: 48.607
[1, 12000] loss: 0.076 time: 38.124
[2,  2000] loss: 0.075 time: 30.924
[2,  4000] loss: 0.075 time: 34.561
[2,  6000] loss: 0.075 time: 30.950
[2,  8000] loss: 0.075 time: 30.780
[2, 10000] loss: 0.074 time: 30.815
[2, 12000] loss: 0.074 time: 30.876
Finished Training

Here is GPU status when training
Screenshot%20from%202018-08-21%2016-50-59

Ranahanocka · August 21, 2018, 9:25pm

You have too many sequential operations (append) which is not parallelizable on the GPU. CPU is faster with sequential compuations. You should be able to do all the appends with the index function, then the GPU will be faster.

brianC · August 22, 2018, 8:27am

Could you tell me how to convert appends to the index function?

And why the loss so low from the beginning? I’m really confused about that.