[SOLVED] CUDA runtime error (11) - RTX 208


I have an experiment I’m developing for some time now, and on my old GPU server it works just fine (Titan X card). Recently I’ve started using a new workstation, with a Geforce RTX 208 and when I run my code I get the following error:

    outputs = model(input_vars)
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/nn/modules/pooling.py", line 148, in forward
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/_jit_internal.py", line 132, in fn
    return if_false(*args, **kwargs)
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 425, in _max_pool2d
    input, kernel_size, stride, padding, dilation, ceil_mode)[0]
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 417, in max_pool2d_with_indices
    return torch._C._nn.max_pool2d_with_indices(input, kernel_size, _stride, padding, dilation, ceil_mode)
RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THCUNN/generic/SpatialDilatedMaxPooling.cu:120

I pasted here the relevant parts of the stack trace.

this is my nvidia-smi output:

| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   43C    P8    13W / 260W |    139MiB / 10986MiB |      0%      Default |

| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0      1332      G   /usr/lib/xorg/Xorg                            59MiB |
|    0      1377      G   /usr/bin/gnome-shell                          78MiB |

I’m pretty baffled by this, I have no idea where to start solving it. Any suggestions would be greatly appreciated. Should mention that on this machine CUDA is running in the following docker environment:

docker run --runtime=nvidia nvidia/cuda:10.0-base

OK so I cleaned up my code a bit and it seems to work fine now.
I had a random process that was built upon many Pytorch exceptions that I catch until what I want works. Changed it to something cleaner that doesn’t rely on exceptions. Will update if message appears again in the run.