CUDA Call Failed with torch.cuda.current_device(), A100, Pytorch 2.2.2, CUDA 12.3, what do I do with env variables to use jupyter notebook?

Hi,

Apologies if this is solved somewhere, I’ve googled around but I can’t figure out what to do.

I have 2.2.2 in a conda environment on a linux machine, installed via

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

I have following nvidia-smi:

Tue Apr  9 03:35:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:31:00.0 Off |                   On |
| N/A   67C    P0             252W / 300W |    791MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off | 00000000:4B:00.0 Off |                   On |
| N/A   67C    P0             251W / 300W |    739MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe          Off | 00000000:B1:00.0 Off |                   On |
| N/A   34C    P0              45W / 300W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe          Off | 00000000:CA:00.0 Off |                   On |
| N/A   34C    P0              42W / 300W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    0   0   0  |             751MiB / 81050MiB  | 98      0 |  7   0    5    1    1 |
|                  |               5MiB / 131072MiB |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    0   0   0  |             725MiB / 81050MiB  | 98      0 |  7   0    5    1    1 |
|                  |               5MiB / 131072MiB |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2    7   0   0  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2    8   0   1  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2    9   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2   10   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2   11   0   4  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2   12   0   5  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2   13   0   6  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3    7   0   0  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3    8   0   1  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3    9   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3   11   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3   12   0   4  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3   13   0   5  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3   14   0   6  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0    0    0    1877752      C   ...odinger2024-1/internal/bin/gdesmond      686MiB |
|    1    0    0    1877801      C   ...odinger2024-1/internal/bin/gdesmond      686MiB |
+---------------------------------------------------------------------------------------+

In my python interperter I get:

>>> torch.__version__
'2.2.2'
>>> torch.cuda.is_available()
True

But, if I run:

>>> torch.cuda.current_device()
# long traceback
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1711403380481/work/aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
# long traceback

I will put the full error below. I am not sure what to do here? I used to run some models with tensorflow that required me to do something like

export CUDA_VISIBLE_DEVICES=MIG-e343b8ed-9408-52cc-90b7-683cf8ec07f0

Before I could run any jobs, usually in a format like this, inside a .sh script:

export CUDA_VISIBLE_DEVICES=MIG-8ad5411d-a2b7-5a4c-9d7e-d973c6c03b08
source activate dd-nvi
python -u progressive_docking_2.py 

I tried to do the same thing here, in terminal I passed:

export CUDA_VISIBLE_DEVICES=MIG-e343b8ed-9408-52cc-90b7-683cf8ec07f0

Then echo:

echo $CUDA_VISIBLE_DEVICES
MIG-e343b8ed-9408-52cc-90b7-683cf8ec07f0

Now I could run

>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.current_device()
0
>>> torch.cuda.get_device_name(0)
'NVIDIA A100 80GB PCIe MIG 1g.10gb'

This all looks correct. I can now move a tensor to device:

>>> X_train = torch.FloatTensor([0., 1., 2.])
>>> X_train = X_train.to("cuda")

However, I can seem to actually use this environment in a jupyter notebook on this server, because I get the error regarding CUDA call failed. I’ve tried setting stuff like this

!export CUDA_VISIBLE_DEVICES=MIG-e343b8ed-9408-52cc-90b7-683cf8ec07f0

But it doesn’t seem to do anything as I can’t echo it back (but I can echo test, see the space between them in the output, indicating $CUDA_VISIBLE_DEVICES is empty.)

And so now when I get to the part of my code that does something with CUDA, it fails with that same CUDA call failed error

(edit: I can only embed one image so I guess I can’t share that line, I am doing this):

X_train = X_train.to(device)

X_test = X_test.to(device)

y_train = y_train.float().to(device)

y_test = y_test.float().to(device)
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1711403380481/work/aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus= The above exception was the direct cause of the following exception: DeferredCudaCallError Traceback (most recent call last) Cell In[27], [line 3](vscode-notebook-cell:?execution_count=27&line=3) [1](vscode-notebook-cell:?execution_count=27&line=1) # %% ----> [3](vscode-notebook-cell:?execution_count=27&line=3) X_train = X_train.to(device) [4](vscode-notebook-cell:?execution_count=27&line=4) X_test = X_test.to(device) [5](vscode-notebook-cell:?execution_count=27&line=5) y_train = y_train.float().to(device)

I am really stuck as to what I am supposed to do here. I tried to be thorough about following a bunch of tutorials before getting here and my models are running (slowly) on my local Mac using MPS. I’ve gotten TensorFlow to work on this server before so I have a little experience but out of ideas.

You could either export CUDA_VISIBLE_DEVICES pointing to the desired MIG slice globally or disable MIG allowing PyTorch to see all devices.