CUDA error, param.data.cuda()/module.to('cuda') error

Aryan_Tomar · June 17, 2024, 12:50pm

Please resolve the below error, PyTorch version = ‘2.3.0+cu118’ and CUDA version of the DGX server is 11.5 (–nvcc version)
Getting the below error

Traceback (most recent call last):
  File "/home/aryan/miniconda3/envs/facekd_new/lib/python3.12/site-packages/torch/cuda/__init__.py", line 306, in _lazy_init
    queued_call()
  File "/home/aryan/miniconda3/envs/facekd_new/lib/python3.12/site-packages/torch/cuda/__init__.py", line 174, in _check_capability
    capability = get_device_capability(d)
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aryan/miniconda3/envs/facekd_new/lib/python3.12/site-packages/torch/cuda/__init__.py", line 430, in get_device_capability
    prop = get_device_properties(device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aryan/miniconda3/envs/facekd_new/lib/python3.12/site-packages/torch/cuda/__init__.py", line 448, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=, num_gpus=

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/aryan/FSCIL/FaceKD/train_test.py", line 397, in <module>
    main(config)
  File "/home/aryan/FSCIL/FaceKD/train_test.py", line 66, in main
    base = BasePatchKD(config, loaders)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aryan/FSCIL/FaceKD/pkd/core/base_patch_kd.py", line 54, in __init__
    self._init_model()
  File "/home/aryan/FSCIL/FaceKD/pkd/core/base_patch_kd.py", line 74, in _init_model
    param.data = param.data.cuda()
                 ^^^^^^^^^^^^^^^^^
  File "/home/aryan/miniconda3/envs/facekd_new/lib/python3.12/site-packages/torch/cuda/__init__.py", line 312, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=, num_gpus=

CUDA call was originally invoked at:

  File "/home/aryan/FSCIL/FaceKD/train_test.py", line 5, in <module>
    from pkd.utils import set_random_seed, time_now
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1310, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/home/aryan/FSCIL/FaceKD/pkd/__init__.py", line 3, in <module>
    from pkd import core, data_loader, models, evaluation, utils, visualization, losses, operation
  File "<frozen importlib._bootstrap>", line 1415, in _handle_fromlist
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/home/aryan/FSCIL/FaceKD/pkd/core/__init__.py", line 3, in <module>
    from .lr_schedulers import WarmupMultiStepLR
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/home/aryan/FSCIL/FaceKD/pkd/core/lr_schedulers.py", line 1, in <module>
    import torch
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/home/aryan/miniconda3/envs/facekd_new/lib/python3.12/site-packages/torch/__init__.py", line 1478, in <module>
    _C._initExtension(manager_path())
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/home/aryan/miniconda3/envs/facekd_new/lib/python3.12/site-packages/torch/cuda/__init__.py", line 238, in <module>
    _lazy_call(_check_capability)
  File "/home/aryan/miniconda3/envs/facekd_new/lib/python3.12/site-packages/torch/cuda/__init__.py", line 235, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))

import torch and torch.cuda.is_available() are working fine

ptrblck · June 17, 2024, 3:55pm

Based on the error message it seems your setup has issues using your GPU and I would assume even calling torch.randn(1).cuda() will fail. If so, make sure your setup is able to use the GPU by running e.g. any CUDA sample.

Aryan_Tomar · June 19, 2024, 5:38pm

@ptrblck thanks for the reply. torch.randn(1).cuda() didn’t fail, gave output tensor([0.0166], device='cuda:0'). Can you guide more?

ptrblck · June 19, 2024, 5:49pm

Is the code still failing after the smoke test of creating a tensor on the device? If so, do you still see the device in nvidia-smi?

Aryan_Tomar · June 20, 2024, 11:32am

I am able to create a random tensor on cuda but how to check whether it’s visible on that device in nvidia-smi.
Below smoke_test script

import torch

def smoke_test():
    if torch.cuda.is_available():
        device = torch.device('cuda')
        try:
            # Attempt to create a tensor on the GPU
            tensor = torch.tensor([1, 2, 3], device=device)
            print(f"Tensor on device {device}: {tensor}")
            return True
        except Exception as e:
            print(f"Failed to create tensor on device {device}: {e}")
            return False
    else:
        print("CUDA is not available.")
        return False

success = smoke_test()
print(f"Smoke test successful: {success}")

gave the following output

Tensor on device cuda: tensor([1, 2, 3], device='cuda:0')
Smoke test successful: True

runnning nvidia-smi gives the following output

Thu Jun 20 17:07:57 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:07:00.0 Off |                    0 |
| N/A   29C    P0              68W / 400W |  66332MiB / 81920MiB |     33%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   31C    P0              85W / 400W |  36568MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:47:00.0 Off |                    0 |
| N/A   26C    P0              66W / 400W |  16564MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:4E:00.0 Off |                    0 |
| N/A   26C    P0              67W / 400W |  16001MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  | 00000000:87:00.0 Off |                    0 |
| N/A   59C    P0             334W / 400W |  56080MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  | 00000000:90:00.0 Off |                    0 |
| N/A   34C    P0              71W / 400W |  16558MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  | 00000000:B7:00.0 Off |                    0 |
| N/A   52C    P0             249W / 400W |  52900MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  | 00000000:BD:00.0 Off |                    0 |
| N/A   33C    P0              67W / 400W |  36420MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2647657      C   python                                    64482MiB |
|    0   N/A  N/A   3531806      C   ...iniconda3/envs/python311/bin/python      554MiB |
|    0   N/A  N/A   3531807      C   ...iniconda3/envs/python311/bin/python      554MiB |
|    0   N/A  N/A   3531808      C   ...iniconda3/envs/python311/bin/python      554MiB |
|    1   N/A  N/A    550956      C   ...n/miniconda3/envs/alice/bin/python3    20564MiB |
|    1   N/A  N/A   2647657      C   python                                    15830MiB |
|    2   N/A  N/A    225618      C   ...n/miniconda3/envs/alice/bin/python3      554MiB |
|    2   N/A  N/A   2647657      C   python                                    15830MiB |
|    3   N/A  N/A   2647657      C   python                                    15830MiB |
|    4   N/A  N/A    225618      C   ...n/miniconda3/envs/alice/bin/python3    40074MiB |
|    4   N/A  N/A   2647657      C   python                                    15830MiB |
|    5   N/A  N/A    190920      C   ...n/miniconda3/envs/alice/bin/python3      554MiB |
|    5   N/A  N/A   2647657      C   python                                    15830MiB |
|    6   N/A  N/A    190920      C   ...n/miniconda3/envs/alice/bin/python3    20610MiB |
|    6   N/A  N/A   1857886      C   python                                    16278MiB |
|    6   N/A  N/A   2647657      C   python                                    15830MiB |
|    7   N/A  N/A    359339      C   ...n/miniconda3/envs/alice/bin/python3    20558MiB |
|    7   N/A  N/A   2647657      C   python                                    15686MiB |
+---------------------------------------------------------------------------------------+

None of the above processes is my process (some other processes are also running for other users, I am using NVIDIA-DGX Server Version 6.1.0(GNU/Linux 5.15.0-1029-nvidia x86_64))

Aryan_Tomar · June 21, 2024, 2:47pm

@ptrblck do you have any idea about this?

ptrblck · June 21, 2024, 8:55pm

Did you run the previously mentioned use case of allocating a single tensor, making sure it’s created on the device, and continuing with your whole script? If so, did it work or are you seeing the same error?

Aryan_Tomar · June 23, 2024, 6:17am

@ptrblck I am not able to allocate the single tensor at the beginning of my script, I am seeing the same error while allocating it. Although, I am able to do it separately, as told in the previous reply and torch.cuda.is_available() is still True in my original script.

ptrblck · June 24, 2024, 12:31am

Just to make sure I understand the current runs: you are able to allocate a tensor on the GPU using a standalone script, but allocating this tensor at the beginning of your actual training script fails? If so, are you using different Python environments as I cannot explain why the same code would fail otherwise.

Aryan_Tomar · June 24, 2024, 12:29pm

@ptrblck I was using the same python env but the error resolved by changing the location of the import torch. Thanks for the help!

ptrblck · June 24, 2024, 3:50pm

Which other libraries did you need to load before/after PyTorch to fix this issue? It seems an import might cause issues in communicating with your GPU.