RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED

devsy · April 20, 2023, 10:11pm

Hi,
I got the following error in my docker container.
It says CUDA is available but couldn’t get the current_device.
I was able to run nvidia-smi in my container, but even I pass the GPUID to CUDA_VISIBLE_DEVICES, the application cannot catch the device.
Could you give any guidance to solve this problem?
Thanks!

root@XXXX:/project# python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10)
[GCC 10.3.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.cuda.is_available()
True
torch.cuda.current_device()
Traceback (most recent call last):
File “/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py”, line 242, in _lazy_init
queued_call()
File “/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py”, line 125, in _check_capability
capability = get_device_capability(d)
File “/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py”, line 357, in get_device_capability
prop = get_device_properties(device)
File “/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py”, line 375, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at “…/aten/src/ATen/cuda/CUDAContext.cpp”:50, please report a bug to PyTorch.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “”, line 1, in
File “/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py”, line 552, in current_device
_lazy_init()
File “/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py”, line 246, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at “…/aten/src/ATen/cuda/CUDAContext.cpp”:50, please report a bug to PyTorch.

CUDA call was originally invoked at:

[’ File “”, line 1, in \n’, ’ File “”, line 991, in _find_and_load\n’, ’ File “”, line 975, in _find_and_load_unlocked\n’, ’ File “”, line 671, in _load_unlocked\n’, ’ File “”, line 843, in exec_module\n’, ’ File “”, line 219, in _call_with_frames_removed\n’, ’ File “/opt/conda/lib/python3.8/site-packages/torch/init.py”, line 798, in \n _C._initExtension(manager_path())\n’, ’ File “”, line 991, in _find_and_load\n’, ’ File “”, line 975, in _find_and_load_unlocked\n’, ’ File “”, line 671, in _load_unlocked\n’, ’ File “”, line 843, in exec_module\n’, ’ File “”, line 219, in _call_with_frames_removed\n’, ’ File “/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py”, line 179, in \n _lazy_call(_check_capability)\n’, ’ File “/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py”, line 177, in _lazy_call\n _queued_calls.append((callable, traceback.format_stack()))\n’]

ptrblck · April 20, 2023, 11:16pm

Are you able to run any other CUDA application in your docker container?
Did you make sure to specify the used GPUs via e.g. docker run --gpus all?

cmosguy · July 30, 2023, 9:36pm

I am on a cluster with A100 GPUs, and I am facing the same issue. Do you have any solutions that can resolve this?

jbm · August 2, 2023, 10:11pm

I’m hitting the same error on an A100 machine. Did you find a way around it?

jbm · August 2, 2023, 10:28pm

What’s confusing about the error is that there doesn’t even seem to be such a property as num_gpus—it’s device_count… weird… clearly seems to suggest a software error of some kind (maybe some deprecated version of something somewhere)

pretzel · August 3, 2023, 2:38pm

Might be different, but thought I’d add in case it helps. I’m not using conda and hit the error somewhat differently.

I got this error on a AWS P4 with 8 A100s this morning in a jupyter notebook. When I ran env variables before I imported torch, it worked:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "4" 
os.environ["WORLD_SIZE"] = "1"
import torch

Seperately, when I ran the exact thing you ran in a interactive session, i got no error:

Python 3.8.12 (default, Nov  2 2021, 13:56:07) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.0.1+cu118'
>>> torch.cuda.is_available()
True
>>> torch.cuda.current_device()
0
>>> exit()

devsy · August 31, 2023, 9:13pm

Sorry for late reply,
I was not able to run other CUDA applications so i reinstalled all drivers and container libraries.
I think there was some mismatch between container suites and drivers.

jackie97 · September 17, 2023, 5:33am

It works for me! That’s genius! Thank you soooo much!

xingyu_hu · November 28, 2023, 1:13pm

excellent! this works for me! Thank you for this ,'cause this helps me solve a issue troubling me for several days!

bhatch · February 7, 2024, 2:01am

I hate myself. It worked for me, too. Mucho thanks @pretzel!

gaoliyao · June 23, 2024, 9:43pm

This is genius… It is working amazingly…

gps6101982 · December 30, 2024, 10:44pm

Sets the devices only to GPU number 5 right? what about multiple GPU’s [“0,1,2,3” in the above case] it gives me the same error once I enable more than one GPU. How do I overcome this?

Travis_Pence · January 12, 2025, 6:37am

This worked for me. I had to set the environment variable before I imported torch