Hi, I am using PyTorch with smp for computer vision.
I’ve trained a model using Python3.6 and training worked. Model even did reasonable predictions.
Model stoped training when I’ve moved to
PyTorch: 2.3.0+cu121
Python 3.10.7
CUDA Version: 12.4
Code works on Windows. The only difference is that on Windows I have python3.10.11
I do have two CUDA cards and do training on the second card:
environ[‘CUDA_VISIBLE_DEVICES’] = ‘1’
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1080 Off | 00000000:00:06.0 Off | N/A |
| 0% 30C P8 6W / 180W | 1790MiB / 8192MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA GeForce GTX 1080 Off | 00000000:00:07.0 Off | N/A |
| 0% 31C P0 38W / 180W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
However training fails with the message below
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at “…/aten/src/ATen/cuda/CUDAContext.cpp”:50, please report a bug to PyTorch. device=, num_gpus=
Full output below:
INFO:albumentations.check_version:A new version of Albumentations is available: 1.4.8 (you have 1.4.7). Upgrade using: pip install --upgrade albumentations
missed model data/ttv_1_5_w_background
Traceback (most recent call last):
File “/home/stepan/.local/lib/python3.10/site-packages/torch/cuda/init.py”, line 306, in _lazy_init
queued_call()
File “/home/stepan/.local/lib/python3.10/site-packages/torch/cuda/init.py”, line 174, in _check_capability
capability = get_device_capability(d)
File “/home/stepan/.local/lib/python3.10/site-packages/torch/cuda/init.py”, line 430, in get_device_capability
prop = get_device_properties(device)
File “/home/stepan/.local/lib/python3.10/site-packages/torch/cuda/init.py”, line 448, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at “…/aten/src/ATen/cuda/CUDAContext.cpp”:50, please report a bug to PyTorch. device=, num_gpus=
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File “/home/stepan/acne_seg_models_dev/train_runner_resnet18_normalized.py”, line 236, in
best_model_filenames[dt] = train(data_root, dt, 0)
File “/home/stepan/acne_seg_models_dev/train_runner_resnet18_normalized.py”, line 188, in train
train_epoch = TrainValidateEpochWithAUX(
File “/home/stepan/acne_seg_models_dev/epoch_with_gm.py”, line 12, in init
super().init(
File “/home/stepan/.local/lib/python3.10/site-packages/segmentation_models_pytorch/utils/train.py”, line 75, in init
super().init(
File “/home/stepan/.local/lib/python3.10/site-packages/segmentation_models_pytorch/utils/train.py”, line 16, in init
self._to_device()
File “/home/stepan/.local/lib/python3.10/site-packages/segmentation_models_pytorch/utils/train.py”, line 19, in _to_device
self.model.to(self.device)
File “/home/stepan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1173, in to
return self._apply(convert)
File “/home/stepan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 779, in _apply
module._apply(fn)
File “/home/stepan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 779, in _apply
module._apply(fn)
File “/home/stepan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 804, in _apply
param_applied = fn(param)
File “/home/stepan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1159, in convert
return t.to(
File “/home/stepan/.local/lib/python3.10/site-packages/torch/cuda/init.py”, line 312, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at “…/aten/src/ATen/cuda/CUDAContext.cpp”:50, please report a bug to PyTorch. device=, num_gpus=
CUDA call was originally invoked at:
File “/home/stepan/acne_seg_models_dev/train_runner_resnet18_normalized.py”, line 7, in
from torch.utils.data import DataLoader
File “”, line 1027, in _find_and_load
File “”, line 992, in _find_and_load_unlocked
File “”, line 241, in _call_with_frames_removed
File “”, line 1027, in _find_and_load
File “”, line 992, in _find_and_load_unlocked
File “”, line 241, in _call_with_frames_removed
File “”, line 1027, in _find_and_load
File “”, line 1006, in _find_and_load_unlocked
File “”, line 688, in _load_unlocked
File “”, line 883, in exec_module
File “”, line 241, in _call_with_frames_removed
File “/home/stepan/.local/lib/python3.10/site-packages/torch/init.py”, line 1478, in
_C._initExtension(manager_path())
File “”, line 1027, in _find_and_load
File “”, line 1006, in _find_and_load_unlocked
File “”, line 688, in _load_unlocked
File “”, line 883, in exec_module
File “”, line 241, in _call_with_frames_removed
File “/home/stepan/.local/lib/python3.10/site-packages/torch/cuda/init.py”, line 238, in
_lazy_call(_check_capability)
File “/home/stepan/.local/lib/python3.10/site-packages/torch/cuda/init.py”, line 235, in _lazy_call
_queued_calls.append((callable, traceback.format_stack()))