RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED

Hi,
I got the following error in my docker container.
It says CUDA is available but couldn’t get the current_device.
I was able to run nvidia-smi in my container, but even I pass the GPUID to CUDA_VISIBLE_DEVICES, the application cannot catch the device.
Could you give any guidance to solve this problem?
Thanks!

root@XXXX:/project# python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10)
[GCC 10.3.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.cuda.is_available()
True
torch.cuda.current_device()
Traceback (most recent call last):
File “/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py”, line 242, in _lazy_init
queued_call()
File “/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py”, line 125, in _check_capability
capability = get_device_capability(d)
File “/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py”, line 357, in get_device_capability
prop = get_device_properties(device)
File “/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py”, line 375, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at “…/aten/src/ATen/cuda/CUDAContext.cpp”:50, please report a bug to PyTorch.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “”, line 1, in
File “/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py”, line 552, in current_device
_lazy_init()
File “/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py”, line 246, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at “…/aten/src/ATen/cuda/CUDAContext.cpp”:50, please report a bug to PyTorch.

CUDA call was originally invoked at:

[’ File “”, line 1, in \n’, ’ File “”, line 991, in _find_and_load\n’, ’ File “”, line 975, in _find_and_load_unlocked\n’, ’ File “”, line 671, in _load_unlocked\n’, ’ File “”, line 843, in exec_module\n’, ’ File “”, line 219, in _call_with_frames_removed\n’, ’ File “/opt/conda/lib/python3.8/site-packages/torch/init.py”, line 798, in \n _C._initExtension(manager_path())\n’, ’ File “”, line 991, in _find_and_load\n’, ’ File “”, line 975, in _find_and_load_unlocked\n’, ’ File “”, line 671, in _load_unlocked\n’, ’ File “”, line 843, in exec_module\n’, ’ File “”, line 219, in _call_with_frames_removed\n’, ’ File “/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py”, line 179, in \n _lazy_call(_check_capability)\n’, ’ File “/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py”, line 177, in _lazy_call\n _queued_calls.append((callable, traceback.format_stack()))\n’]

Are you able to run any other CUDA application in your docker container?
Did you make sure to specify the used GPUs via e.g. docker run --gpus all?

I am on a cluster with A100 GPUs, and I am facing the same issue. Do you have any solutions that can resolve this?

I’m hitting the same error on an A100 machine. Did you find a way around it?

What’s confusing about the error is that there doesn’t even seem to be such a property as num_gpus—it’s device_count… weird… clearly seems to suggest a software error of some kind (maybe some deprecated version of something somewhere)

Might be different, but thought I’d add in case it helps. I’m not using conda and hit the error somewhat differently.

I got this error on a AWS P4 with 8 A100s this morning in a jupyter notebook. When I ran env variables before I imported torch, it worked:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "4" 
os.environ["WORLD_SIZE"] = "1"
import torch

Seperately, when I ran the exact thing you ran in a interactive session, i got no error:

Python 3.8.12 (default, Nov  2 2021, 13:56:07) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.0.1+cu118'
>>> torch.cuda.is_available()
True
>>> torch.cuda.current_device()
0
>>> exit()
7 Likes

Sorry for late reply,
I was not able to run other CUDA applications so i reinstalled all drivers and container libraries.
I think there was some mismatch between container suites and drivers.

It works for me! That’s genius! Thank you soooo much!

excellent! this works for me! Thank you for this ,'cause this helps me solve a issue troubling me for several days!

I hate myself. It worked for me, too. Mucho thanks @pretzel!

This is genius… It is working amazingly…