Some observations on "cuda runtime error (30)"

KFrank · August 6, 2019, 4:16pm

Hello Forum!

I have some information about the behavior of “cuda runtime
error (30)” (probably somewhat specific to my particular
configuration).

This is a follow-on to a number of threads about “error 30,”
and, in particular, to this post:

Clued in by Andrei’s observation that torch.cuda.is_available()
“breaks” cuda, I find (for me) that if torch.cuda.is_available()
is the first cuda call, subsequent cuda calls will throw “error 30”
unless the first subsequent call is called promptly (< ~1 sec.).

Specifically, start a fresh python session and import torch, then:

don’t call torch.cuda.is_available()
subsequent cuda calls work

call torch.cuda.is_available() first
make another cuda in less than (about) a second
it and subsequent cuda calls work

call torch.cuda.is_available() first
wait about two seconds
subsequent cuda calls throw “error 30” with stack trace A
(except torch.cuda.device_count() works in this case)

call torch.cuda.is_available() first
wait more than (about) ten seconds
subsequent cuda calls throw “error 30” with stack trace B

(See example scripts, below.)

This seems to be repeatable for me, although, because it involves
timing and the order in which calls are made, I don’t know whether
it is always consistently repeatable.

I am using pytorch 0.3.0 for cuda 8, specifically
pytorch-0.3.0-py36_0.3.0cu80.tar.bz2 from the legacy-builds link
in this post of Peter’s:

I am using a mobile (laptop) Quadro K1100M gpu with the nvidia driver
version 426.00 running on windows 10. (As a side note, I initially
saw this “error 30” issue using the nvidia driver version 425.45.
I would guess these observations also apply to that driver, but I don’t
know for sure because I upgraded the driver before I came across this
business with torch.cuda.is_available() “breaking” cuda and
performing these tests.)

Anyway, I have no idea what is going on here. I’m posting these
results in the hope they might shed some light on the “error 30”
issue, and be helpful to others. I suspect that these results
are specific to configurations similar to mine, and are not likely
to be relevant to all instances of “error 30”.

Thanks.

K. Frank

Here are some illustrative scripts and their outputs. Each was
run in a fresh python session started from the command line. (The
sleep (10.0)in the first script is just to show that it’s waiting
after torch.cuda.is_available() that is relevant, rather than,
for example, waiting after import torch.)

No wait – cuda works:

from time import sleep
import torch
print (torch.__version__)
sleep (10.0)
torch.cuda.is_available()
torch.cuda.current_device()
torch.cuda.device_count()
torch.cuda.get_device_name (0)
torch.cuda.get_device_capability (0)
quit()

No-wait result:

Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from time import sleep
>>> import torch
>>> print (torch.__version__)
0.3.0b0+591e73e
>>> sleep (10.0)
>>> torch.cuda.is_available()
True
>>> torch.cuda.current_device()
0
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_name (0)
'Quadro K1100M'
>>> torch.cuda.get_device_capability (0)
(3, 0)
>>> quit()

Short wait – cuda works:

from time import sleep
import torch
print (torch.__version__)
torch.cuda.is_available()
sleep (1.0)
torch.cuda.current_device()
torch.cuda.device_count()
torch.cuda.get_device_name (0)
torch.cuda.get_device_capability (0)
quit()

Short-wait result:

Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from time import sleep
>>> import torch
>>> print (torch.__version__)
0.3.0b0+591e73e
>>> torch.cuda.is_available()
True
>>> sleep (1.0)
>>> torch.cuda.current_device()
0
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_name (0)
'Quadro K1100M'
>>> torch.cuda.get_device_capability (0)
(3, 0)
>>> quit()

Medium wait – cuda fails (A):

from time import sleep
import torch
print (torch.__version__)
torch.cuda.is_available()
sleep (2.0)
torch.cuda.current_device()
torch.cuda.device_count()
torch.cuda.get_device_name (0)
torch.cuda.get_device_capability (0)
quit()

Medium-wait result:

Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from time import sleep
>>> import torch
>>> print (torch.__version__)
0.3.0b0+591e73e
>>> torch.cuda.is_available()
True
>>> sleep (2.0)
>>> torch.cuda.current_device()
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=143 error=30 : unknown error
Traceback (most recent call last):
  File "C:\<path_to_miniconda>\Miniconda3\lib\site-packages\torch\cuda\__init__.py", line 151, in _lazy_init
    queued_call()
  File "C:\<path_to_miniconda>\Miniconda3\lib\site-packages\torch\cuda\__init__.py", line 103, in _check_capability
    major = get_device_capability(d)[0]
  File "C:\<path_to_miniconda>\Miniconda3\lib\site-packages\torch\cuda\__init__.py", line 266, in get_device_capability
    return torch._C._cuda_getDeviceCapability(device)
RuntimeError: cuda runtime error (30) : unknown error at torch/csrc/cuda/Module.cpp:143

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\<path_to_miniconda>\Miniconda3\lib\site-packages\torch\cuda\__init__.py", line 302, in current_device
    _lazy_init()
  File "C:\<path_to_miniconda>\Miniconda3\lib\site-packages\torch\cuda\__init__.py", line 155, in _lazy_init
    raise_from(DeferredCudaCallError(msg), e)
  File "<string>", line 3, in raise_from
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: cuda runtime error (30) : unknown error at torch/csrc/cuda/Module.cpp:143

CUDA call was originally invoked at:

['  File "<stdin>", line 1, in <module>\n', '  File "<frozen importlib._bootstrap>", line 971, in _find_and_load\n', '  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked\n', '  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked\n', '  File "<frozen importlib._bootstrap_external>", line 678, in exec_module\n', '  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed\n', '  File "C:\\<path_to_miniconda>\\Miniconda3\\lib\\site-packages\\torch\\__init__.py", line 328, in <module>\n    import torch.cuda\n', '  File "<frozen importlib._bootstrap>", line 971, in _find_and_load\n', '  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked\n', '  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked\n', '  File "<frozen importlib._bootstrap_external>", line 678, in exec_module\n', '  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed\n', '  File "C:\\<path_to_miniconda>\\Miniconda3\\lib\\site-packages\\torch\\cuda\\__init__.py", line 118, in <module>\n    _lazy_call(_check_capability)\n', '  File "C:\\<path_to_miniconda>\\Miniconda3\\lib\\site-packages\\torch\\cuda\\__init__.py", line 116, in _lazy_call\n    _queued_calls.append((callable, traceback.format_stack()))\n']
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_name (0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\<path_to_miniconda>\Miniconda3\lib\site-packages\torch\cuda\__init__.py", line 253, in get_device_name
    return torch._C._cuda_getDeviceName(device)
RuntimeError: cuda runtime error (30) : unknown error at torch/csrc/cuda/Module.cpp:131
>>> torch.cuda.get_device_capability (0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\<path_to_miniconda>\Miniconda3\lib\site-packages\torch\cuda\__init__.py", line 266, in get_device_capability
    return torch._C._cuda_getDeviceCapability(device)
RuntimeError: cuda runtime error (30) : unknown error at torch/csrc/cuda/Module.cpp:143
>>> quit()

Long wait – cuda fails (B):

from time import sleep
import torch
print (torch.__version__)
torch.cuda.is_available()
sleep (10.0)
torch.cuda.current_device()
torch.cuda.device_count()
torch.cuda.get_device_name (0)
torch.cuda.get_device_capability (0)
quit()

Long-wait result:

Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from time import sleep
>>> import torch
>>> print (torch.__version__)
0.3.0b0+591e73e
>>> torch.cuda.is_available()
True
>>> sleep (10.0)
>>> torch.cuda.current_device()
THCudaCheck FAIL file=D:\pytorch\pytorch\torch\lib\THC\THCGeneral.c line=120 error=30 : unknown error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\<path_to_miniconda>\Miniconda3\lib\site-packages\torch\cuda\__init__.py", line 302, in current_device
    _lazy_init()
  File "C:\<path_to_miniconda>\Miniconda3\lib\site-packages\torch\cuda\__init__.py", line 140, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at D:\pytorch\pytorch\torch\lib\THC\THCGeneral.c:120
>>> torch.cuda.device_count()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\<path_to_miniconda>\Miniconda3\lib\site-packages\torch\cuda\__init__.py", line 294, in device_count
    _lazy_init()
  File "C:\<path_to_miniconda>\Miniconda3\lib\site-packages\torch\cuda\__init__.py", line 140, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at D:\pytorch\pytorch\torch\lib\THC\THCGeneral.c:120
>>> torch.cuda.get_device_name (0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\<path_to_miniconda>\Miniconda3\lib\site-packages\torch\cuda\__init__.py", line 253, in get_device_name
    return torch._C._cuda_getDeviceName(device)
RuntimeError: cuda runtime error (30) : unknown error at torch/csrc/cuda/Module.cpp:131
>>> torch.cuda.get_device_capability (0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\<path_to_miniconda>\Miniconda3\lib\site-packages\torch\cuda\__init__.py", line 266, in get_device_capability
    return torch._C._cuda_getDeviceCapability(device)
RuntimeError: cuda runtime error (30) : unknown error at torch/csrc/cuda/Module.cpp:143
>>> quit()