Cuda runtime error (4) when pushing model to cuda

gerazov · January 19, 2018, 3:02pm

I’m just starting with PyTorch (coming from Theano) and it’s awesome!

I posted this as an issue on gitHub https://github.com/pytorch/pytorch/issues/4742, but maybe it’s just me …

I put together a linear regressor with a single hidden layer and it works fine but crashes if I try pushing it to cuda. There might be plenty of wrong with the code but I don’t know where to look. Here’s the model and the code that crashes it:

class Net(torch.nn.Module):
    def __init__(self, n_feature, n_hidden, n_output):
        super(Net, self).__init__()
        self.n_feature = n_feature
        self.n_hidden = n_hidden
        self.n_output = n_output
        
        self.wh = Parameter(torch.Tensor(n_feature, n_hidden))
        self.bh = Parameter(torch.Tensor(n_hidden))
        
        self.wy = Parameter(torch.Tensor(n_hidden, n_output))
        self.by = Parameter(torch.Tensor(n_output))
        
        self.reset_parameters()
    
    def reset_parameters(self):
        stdv = 1. / np.sqrt(self.wh.size(1))  
        self.wh.data.uniform_(-stdv, stdv)
        self.bh.data.uniform_(-stdv, stdv)
        
        stdv = 1. / np.sqrt(self.wy.size(1))  
        self.wy.data.uniform_(-stdv, stdv)
        self.by.data.uniform_(-stdv, stdv)
    
    def forward(self, x):
        h = x.mm(self.wh) + self.bh
        a = F.logsigmoid(h)      # activation function for hidden layer
        y = a.mm(self.wy) + self.by
        return y

net = Net(n_feature=1, n_hidden=10, n_output=1).cuda()

The traceback I get is:

Traceback (most recent call last):

  File "<ipython-input-137-44e8bf8308ac>", line 1, in <module>
    net = Net(n_feature=1, n_hidden=10, n_output=1).cuda()     # define the network

  File "~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 216, in cuda
    return self._apply(lambda t: t.cuda(device))

  File "~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 152, in _apply
    param.data = fn(param.data)

  File "~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 216, in <lambda>
    return self._apply(lambda t: t.cuda(device))

  File "~/miniconda3/lib/python3.6/site-packages/torch/_utils.py", line 69, in _cuda
    return new_type(self.size()).copy_(self, async)

RuntimeError: cuda runtime error (4) : unspecified launch failure at /opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/generic/THCTensorCopy.c:20

I was guessing it’s because we don’t use nn or functional to define the layers, but I don’t know. If yes what is a good way to mix equations and available layers?

I also tried:

        dtype = torch.cuda.FloatTensor
        self.wh = Parameter(torch.Tensor(n_feature, n_hidden).type(dtype))
        self.bh = Parameter(torch.Tensor(n_hidden).type(dtype))
        ...

But this gives the same error.

richard · January 19, 2018, 3:29pm

You can use a mix of nn or functional to define the layers. Your code doesn’t crash on my machine.

It sounds like pytorch isn’t able to push memory to your GPU. Could you let me know the following:

which pytorch version you’re using (torch.__version__)
how you installed pytorch?
what CUDA version you’re using (in a command line, do nvcc --version)
the output of your nvidia-smi

gerazov · January 19, 2018, 3:36pm

Ok, that’s awesome!

My torch.__version__ is 0.3.0.post4
I installed using conda (miniconda 3) on linux.
cat /usr/local/cuda/version.txt -> CUDA Version 8.0.61
nvidia-smi :

Fri Jan 19 16:34:41 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M2000        Off  | 0000:03:00.0      On |                  N/A |
| 58%   49C    P0    25W /  75W |   2181MiB /  4036MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1298    G   /opt/viber/Viber                                49MiB |
|    0      1984    G   /usr/lib/xorg/Xorg                               0MiB |
|    0      2742    G   /usr/lib/xorg/Xorg                             282MiB |
|    0     19032    G   /usr/bin/gnome-shell                           225MiB |
|    0     19704    G   /usr/lib/firefox/firefox                         1MiB |
|    0     21829    C   /usr/lib/libreoffice/program/soffice.bin        31MiB |
|    0     26772    G   /usr/lib/firefox/firefox                       162MiB |
|    0     26774    G   /usr/lib/firefox/firefox                        62MiB |
|    0     26780    G   /usr/lib/firefox/firefox                         1MiB |
|    0     29663    G   ~/miniconda3/bin/python                         41MiB |
|    0     29767    C   ~/miniconda3/bin/python                       1313MiB |
+-----------------------------------------------------------------------------+

gerazov · January 19, 2018, 3:39pm

Wait something’s fishy - it was working when pushing variables to cuda and running pytorch, but now gives the same error. In the meantime the computer went to sleep. Maybe I need to restart my system. Give me a minute …

richard · January 19, 2018, 3:41pm

I remember some people before mentioning that pytorch + the gpu doesn’t work too well with computers sleeping. This could be something like that.

gerazov · January 19, 2018, 3:49pm

Exactly - rebooted and it works like a charm

Thank you and keep up the good work!
I’ll also keep spreading the word about PyTorch