PyTorch2.3.0+cu121 fails to see CUDA graphics card

Sam_Tux · May 31, 2024, 9:20am

Hi, I am using PyTorch with smp for computer vision.
I’ve trained a model using Python3.6 and training worked. Model even did reasonable predictions.

Model stoped training when I’ve moved to
PyTorch: 2.3.0+cu121
Python 3.10.7
CUDA Version: 12.4

Code works on Windows. The only difference is that on Windows I have python3.10.11

I do have two CUDA cards and do training on the second card:
environ[‘CUDA_VISIBLE_DEVICES’] = ‘1’

±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1080 Off | 00000000:00:06.0 Off | N/A |
| 0% 30C P8 6W / 180W | 1790MiB / 8192MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA GeForce GTX 1080 Off | 00000000:00:07.0 Off | N/A |
| 0% 31C P0 38W / 180W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+

However training fails with the message below

torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at “…/aten/src/ATen/cuda/CUDAContext.cpp”:50, please report a bug to PyTorch. device=, num_gpus=

Full output below:
INFO:albumentations.check_version:A new version of Albumentations is available: 1.4.8 (you have 1.4.7). Upgrade using: pip install --upgrade albumentations
missed model data/ttv_1_5_w_background
Traceback (most recent call last):
File “/home/stepan/.local/lib/python3.10/site-packages/torch/cuda/init.py”, line 306, in _lazy_init
queued_call()
File “/home/stepan/.local/lib/python3.10/site-packages/torch/cuda/init.py”, line 174, in _check_capability
capability = get_device_capability(d)
File “/home/stepan/.local/lib/python3.10/site-packages/torch/cuda/init.py”, line 430, in get_device_capability
prop = get_device_properties(device)
File “/home/stepan/.local/lib/python3.10/site-packages/torch/cuda/init.py”, line 448, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at “…/aten/src/ATen/cuda/CUDAContext.cpp”:50, please report a bug to PyTorch. device=, num_gpus=

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/home/stepan/acne_seg_models_dev/train_runner_resnet18_normalized.py”, line 236, in
best_model_filenames[dt] = train(data_root, dt, 0)
File “/home/stepan/acne_seg_models_dev/train_runner_resnet18_normalized.py”, line 188, in train
train_epoch = TrainValidateEpochWithAUX(
File “/home/stepan/acne_seg_models_dev/epoch_with_gm.py”, line 12, in init
super().init(
File “/home/stepan/.local/lib/python3.10/site-packages/segmentation_models_pytorch/utils/train.py”, line 75, in init
super().init(
File “/home/stepan/.local/lib/python3.10/site-packages/segmentation_models_pytorch/utils/train.py”, line 16, in init
self._to_device()
File “/home/stepan/.local/lib/python3.10/site-packages/segmentation_models_pytorch/utils/train.py”, line 19, in _to_device
self.model.to(self.device)
File “/home/stepan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1173, in to
return self._apply(convert)
File “/home/stepan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 779, in _apply
module._apply(fn)
File “/home/stepan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 779, in _apply
module._apply(fn)
File “/home/stepan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 804, in _apply
param_applied = fn(param)
File “/home/stepan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1159, in convert
return t.to(
File “/home/stepan/.local/lib/python3.10/site-packages/torch/cuda/init.py”, line 312, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at “…/aten/src/ATen/cuda/CUDAContext.cpp”:50, please report a bug to PyTorch. device=, num_gpus=

CUDA call was originally invoked at:

File “/home/stepan/acne_seg_models_dev/train_runner_resnet18_normalized.py”, line 7, in
from torch.utils.data import DataLoader
File “”, line 1027, in _find_and_load
File “”, line 992, in _find_and_load_unlocked
File “”, line 241, in _call_with_frames_removed
File “”, line 1027, in _find_and_load
File “”, line 992, in _find_and_load_unlocked
File “”, line 241, in _call_with_frames_removed
File “”, line 1027, in _find_and_load
File “”, line 1006, in _find_and_load_unlocked
File “”, line 688, in _load_unlocked
File “”, line 883, in exec_module
File “”, line 241, in _call_with_frames_removed
File “/home/stepan/.local/lib/python3.10/site-packages/torch/init.py”, line 1478, in
_C._initExtension(manager_path())
File “”, line 1027, in _find_and_load
File “”, line 1006, in _find_and_load_unlocked
File “”, line 688, in _load_unlocked
File “”, line 883, in exec_module
File “”, line 241, in _call_with_frames_removed
File “/home/stepan/.local/lib/python3.10/site-packages/torch/cuda/init.py”, line 238, in
_lazy_call(_check_capability)
File “/home/stepan/.local/lib/python3.10/site-packages/torch/cuda/init.py”, line 235, in _lazy_call
_queued_calls.append((callable, traceback.format_stack()))

ptrblck · May 31, 2024, 12:48pm

It seems your system might have issues communicating with your GPU. You could try compiling and running any other CUDA application to verify if you are still seeing the same issue. If so, you might need to reinstall your NVIDIA drivers.

Sam_Tux · May 31, 2024, 2:05pm

I will try. Though it is strange that
(1) Code works on earlier version of Python and Torch (on the same system)
(2) import torch
print(porch.cuda.is_available()) # true