RuntimeError: cuda runtime error (999) : unknown error at /opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/THC/THCGeneral.cpp:47

Why sometimes during running my code this error happens and even after killing the jupyter notebook the cuda is not available and how can I fix it without having to restart my machine?

(base) mona@mona:~/research/facial_landmark$ nvidia-smi
Thu Oct  8 01:41:33 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   48C    P8    18W /  N/A |    901MiB /  7982MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1306      G   /usr/lib/xorg/Xorg                569MiB |
|    0   N/A  N/A      1743      G   /usr/bin/gnome-shell              303MiB |
|    0   N/A  N/A      3069      G   /usr/lib/firefox/firefox            2MiB |
|    0   N/A  N/A      3273      G   /usr/lib/firefox/firefox            2MiB |
|    0   N/A  N/A      3359      G   /usr/lib/firefox/firefox            2MiB |
|    0   N/A  N/A      3844      G   /usr/lib/firefox/firefox            2MiB |
|    0   N/A  N/A      3944      G   /usr/lib/firefox/firefox            2MiB |
|    0   N/A  N/A      4148      G   /usr/lib/firefox/firefox            2MiB |
|    0   N/A  N/A      4222      G   /usr/lib/firefox/firefox            2MiB |
|    0   N/A  N/A     65964      G   /usr/lib/firefox/firefox            2MiB |
+-----------------------------------------------------------------------------+
(base) mona@mona:~/research/facial_landmark$ python
Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-20-d7c246e4a17d> in <module>
      1 #torch.autograd.set_detect_anomaly(True)
      2 network = Network()
----> 3 network.cuda()
      4 
      5 criterion = nn.MSELoss()

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in cuda(self, device)
    456             Module: self
    457         """
--> 458         return self._apply(lambda t: t.cuda(device))
    459 
    460     def cpu(self: T) -> T:

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
    352     def _apply(self, fn):
    353         for module in self.children():
--> 354             module._apply(fn)
    355 
    356         def compute_should_use_set_data(tensor, tensor_applied):

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
    352     def _apply(self, fn):
    353         for module in self.children():
--> 354             module._apply(fn)
    355 
    356         def compute_should_use_set_data(tensor, tensor_applied):

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
    374                 # `with torch.no_grad():`
    375                 with torch.no_grad():
--> 376                     param_applied = fn(param)
    377                 should_use_set_data = compute_should_use_set_data(param, param_applied)
    378                 if should_use_set_data:

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in <lambda>(t)
    456             Module: self
    457         """
--> 458         return self._apply(lambda t: t.cuda(device))
    459 
    460     def cpu(self: T) -> T:

~/anaconda3/lib/python3.7/site-packages/torch/cuda/__init__.py in _lazy_init()
    188             raise AssertionError(
    189                 "libcudart functions unavailable. It looks like you have a broken build?")
--> 190         torch._C._cuda_init()
    191         # Some of the queued calls may reentrantly call _lazy_init();
    192         # we need to just return without initializing in that case.

RuntimeError: cuda runtime error (999) : unknown error at /opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/THC/THCGeneral.cpp:47

You could try to reload the the nvidia kernel module via this approach.

I have a similar problem.

root@4a2115485a52:/host/mlbench/pytorch# python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False
>>> quit()
root@4a2115485a52:/host/mlbench/pytorch# nvidia-smi
Mon Aug 30 02:14:13 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03   Driver Version: 450.119.03   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P2000        Off  | 00000000:01:00.0  On |                  N/A |
| 48%   37C    P8     6W /  75W |    119MiB /  5055MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

But I am in a docker container. modprobe is not installed in the container.
Running modprobe on the host gives an error:

peter@mouse:~/ml/mlbench$ sudo rmmod nvidia_uvm
rmmod: ERROR: Module nvidia_uvm is not currently loaded
peter@mouse:~/ml/mlbench$ sudo modprobe nvidia_uvm
modprobe: FATAL: Module nvidia_uvm not found in directory /lib/modules/5.11.0-27-generic

NVdrivers, CUDA, cuDNN versions inside container:

cuda path:/usr/local/cuda
NVDRV:450.119.03,CUDA:10.2,cuDNN:7.6.5.32-1

CUDA is not installed on the host.

Restarting the host machine (with Ubuntu 20.04) does not solve the problem.

Upgrading NVIDIA drivers 450 → 460 fixed the problem for me. :grinning: