CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`

I see this has been posted a few times before, but none of those responses have helped.

From here, I’m trying:

CUDA_LAUNCH_BLOCKING=1 && export CUDA_VISIBLE_DEVICES=0 && python train_transformer_encoder.py --batch_size 1

But I’m getting:

Traceback (most recent call last):
  File "train_transformer_encoder.py", line 207, in <module>
    main()
  File "train_transformer_encoder.py", line 129, in main
    train_loss = train(reconstruct_spect_model, optimizer=optimizer, dataset=train_loader)
  File "train_transformer_encoder.py", line 194, in train
    pred = model(x)
  File "/home/shamoon/.local/share/virtualenvs/speech-reconstruction-7HMT9fTW/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/shamoon/speech-reconstruction/models/transformer_reconstruct.py", line 45, in forward
    src = self.inp_embedding(src)
  File "/home/shamoon/.local/share/virtualenvs/speech-reconstruction-7HMT9fTW/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/shamoon/.local/share/virtualenvs/speech-reconstruction-7HMT9fTW/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/shamoon/.local/share/virtualenvs/speech-reconstruction-7HMT9fTW/lib/python3.8/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`

So I’ve tried a batch_size of 1 and the CUDA_LAUNCH_BLOCKING as well, but no dice. Any help would be greatly appreciated.

Thank you in advance

2 Likes

Also - I want to add that when. I run on CPU, it’s fine.

1 Like

I created a MUCH simpler example:

import torch

if __name__ == "__main__":
    DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    m = torch.nn.Linear(20, 30).to(DEVICE)
    input = torch.randn(128, 20).to(DEVICE)
    output = m(input)
    print('output', output.size())
    exit()

and get the same error.

I am joining the wagon.

I am getting the same error as @shamoons on the same example. The fix of CUDA_LAUNCH_BLOCKING=1 and CUDA_VISIBLE_DEVICES=0 did nothing. Running on CPU works well.

This operation succeeds:

>>> a = torch.tensor([1]).cuda()
>>> b = torch.rand([1]).cuda()
>>> c = a + b
>>> print(c)
$ tensor([2], device='cuda:0')

The following throws the original error in this post

>>> l = torch.nn.Linear(1, 1).cuda()
>>> a = torch.tensor([1.]).cuda()
>>> l(a)

My specs:
PyTorch version: 1.8.0
CUDA version: 11.0
Driver version: 450.102.04
NVIDIA-SMI: 450.102.04
GPU: NVIDIA GeForce RTX 2080 SUPER
OS: Ubuntu 20.04.2
Kernel: 5.8.0-44

PARTIAL FIX:

It seems that downgrading to PyTorch version 1.7.1 fixed the issue for me. This is obviously not idea, but maybe it can help you too @shamoons.

It would still be nice to hear from official sources to see if they have something to say :smiley: (thank you a lot torch devs and admins for making our lives better)

3 Likes

It helped with the simple case by downgrading. But now my actual code has the same error. If it matters, I’m using a virtual environment.

Can you share the same specs that I shared for my system above? Hopefully we can figure something out

Are you on Discord or something perhaps? We can screen share

@shamoons @Daniel_Hernandez could you both post the output of python -m torch.utils.collect_env?

@Daniel_Hernandez 's env shows CUDA11.0, which is not shipped in the 1.8.0 binaries (10.2 and 11.1 are used), so I’m unsure where this package comes from. Did you build it from source?

If you are using the 10.2 pip wheels (not conda), note that sm_75 was pruned from them and you would have to use the 11.1 pip wheels.
This issue is also tracked here.

It’s the legendary @ptrblck! Thank you a lot, truly.

I am going to post 2 different calls to python -m torch.utils.collect_env. One from my system environment using version 1.7.1 and another using torch 1.8.0 on a virtual environment. Oddly enough it seems that torch is using a different CUDA version than what’s specified in nvidia-smi! (it’s using 10.2, instead of 11.0).

Torch version 1.7.1 where the original error message DOES NOT appear:

Collecting environment information...
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 2080 SUPER
Nvidia driver version: 450.102.04
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.1
[pip3] torch==1.7.1
[conda] Could not collect

Now, the virtualenv 1.8 version where the original error DOES happen:

Collecting environment information...
PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 2080 SUPER
Nvidia driver version: 450.102.04
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.1
[pip3] torch==1.8.0
[conda] Could not collect
Did you build from source?

No, I installed via pip install torch with pip version 20.0.2 on Python version 3.8.5. The pip version inside my virtualenv is 21.0.1

I am sorry for the noob question. How might I install the 11.1 pip wheels? (assuming that 11.1 refers to the conda version)

Thanks again!
Dani

2 Likes

That’s expected, since the pip wheels and conda binaries ship with their own CUDA runtime.
Your local CUDA toolkit (shown via e.g. nvidia-smi) would be used if you are building a custom CUDA extension or PyTorch from source.

Thanks for the env information. Based on this, you are indeed using the pip wheels with the CUDA10.2 runtime, which are broken on the Turing architecture (see the linked issue).

11.1 refers to the CUDA runtime version. You can install it by selecting CUDA 11.1 here:

pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
4 Likes

Oh yeah! It worked! And I’ve now learnt that pip wheels ship with their own CUDA runtime!

Installing torch with CUDA 11.1 with the following command did fix the initial issue with torch 1.8:

 pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html

May you have an amazing day! Good luck @shamoons with you issue, do try out @ptrblck’s suggestions and hopefully you’ll be set!

4 Likes

I’ve encountered the same error but with torch 1.10.2 and CUDA 11.3.

the error info as below:

Traceback (most recent call last):
  File "train.py", line 349, in <module>
    main_train()
  File "train.py", line 315, in main_train
    train(rank, args, train_dataset, valid_dataset, model, collator, tokenizer)
  File "train.py", line 199, in train
    loss += train_iter(model, batch, optimizer, scheduler, device)
  File "train.py", line 99, in train_iter
    labels=labels, return_dict=False)
  File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1588, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/transformers/models/led/modeling_led.py", line 2370, in forward
    return_dict=return_dict,
  File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/transformers/models/led/modeling_led.py", line 2238, in forward
    return_dict=return_dict,
  File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/transformers/models/led/modeling_led.py", line 2102, in forward
    use_cache=use_cache,
  File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/transformers/models/led/modeling_led.py", line 1024, in forward
    output_attentions=output_attentions,
  File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/transformers/models/led/modeling_led.py", line 819, in forward
    attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
terminate called after throwing an instance of 'c10::Error'
  what():  NCCL error in: /opt/conda/conda-bld/pytorch_1640811805959/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:181, unhandled cuda error, NCCL version 21.0.3

the output of python -m torch.utils.collect_env:

PyTorch version: 1.10.2
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.9

Python version: 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51)  [GCC 7.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-65-generic-x86_64-with-debian-bullseye-sid
Is CUDA available: True
CUDA runtime version: 11.3.58
GPU models and configuration:
GPU 0: NVIDIA A40
GPU 1: NVIDIA A40
GPU 2: NVIDIA A40
GPU 3: NVIDIA A40

Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.10.2
[conda] blas                      1.0                         mkl    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] cudatoolkit               11.3.1               h2bc3f7f_2    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl                       2022.0.1           h06a4308_117    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] numpy                     1.19.5                    <pip>
[conda] pytorch                   1.10.2          py3.6_cuda11.3_cudnn8.2.0_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
[conda] pytorch-mutex             1.0                        cuda    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch

I installed torch via conda install pytorch cudatoolkit=11.3

Could you post a minimal, executable code snippet to reproduce this issue, please?

Hey there,
I have encountered the same problem as @skpig .
I ran the simple code given by shamoons (as shown below) in two different GPUs, which are Tesla V100-PCIE-32GB, and NVIDIA A100-SXM-80GB

import torch

if __name__ == "__main__":
    DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    m = torch.nn.Linear(20, 30).to(DEVICE)
    input = torch.randn(128, 20).to(DEVICE)
    output = m(input)
    print('output', output.size())
    exit()

The code above run fine on Tesla V100, but when it runs on NVIDIA A100, it throws out the following error:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

I then execute python -m torch.utils.collect_env for both GPUs to try find any discrepancies between the two GPUs I’m testing on.
For Tesla V100, it gave:

PyTorch version: 1.10.2
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: CentOS Stream release 8 (x86_64)
GCC version: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-8)
Clang version: 13.0.0 (Red Hat 13.0.0-3.module_el8.6.0+1074+380cef3f)
CMake version: version 3.20.2
Libc version: glibc-2.28

Python version: 3.9.7 (default, Sep 16 2021, 13:09:58)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.18.0-358.el8.x86_64-x86_64-with-glibc2.28
Is CUDA available: True
CUDA runtime version: 11.2.152
GPU models and configuration: 
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB
GPU 2: Tesla V100-PCIE-32GB
GPU 3: Tesla V100-PCIE-32GB
GPU 4: Tesla V100-PCIE-32GB
GPU 5: Tesla V100-PCIE-32GB
GPU 6: Tesla V100-PCIE-32GB
GPU 7: Tesla V100-PCIE-32GB

Nvidia driver version: 470.82.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.20.3
[pip3] numpydoc==1.2
[pip3] pytorch-msssim==0.2.1
[pip3] torch==1.10.2
[pip3] torchaudio==0.10.2
[pip3] torchvision==0.11.3
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.3.1               h2bc3f7f_2  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0            py39h7f8727e_0  
[conda] mkl_fft                   1.3.1            py39hd3c417c_0  
[conda] mkl_random                1.2.2            py39h51133e4_0  
[conda] mypy_extensions           0.4.3            py39h06a4308_1  
[conda] numpy                     1.20.3           py39hf144106_0  
[conda] numpy-base                1.20.3           py39h74d4b33_0  
[conda] numpydoc                  1.2                pyhd3eb1b0_0  
[conda] pytorch                   1.10.2          py3.9_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-msssim            0.2.1                    pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                0.10.2               py39_cu113    pytorch
[conda] torchvision               0.11.3               py39_cu113    pytorch

And for NVIDIA A100, it gave:

PyTorch version: 1.10.2
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: CentOS Stream release 8 (x86_64)
GCC version: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10)
Clang version: 13.0.0 (Red Hat 13.0.0-3.module_el8.6.0+1074+380cef3f)
CMake version: version 3.20.2
Libc version: glibc-2.28

Python version: 3.9.7 (default, Sep 16 2021, 13:09:58)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.18.0-348.el8.x86_64-x86_64-with-glibc2.28
Is CUDA available: True
CUDA runtime version: 11.2.152
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM-80GB
GPU 1: NVIDIA A100-SXM-80GB
GPU 2: NVIDIA A100-SXM-80GB
GPU 3: NVIDIA A100-SXM-80GB

Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.20.3
[pip3] numpydoc==1.2
[pip3] pytorch-msssim==0.2.1
[pip3] torch==1.10.2
[pip3] torchaudio==0.10.2
[pip3] torchvision==0.11.3
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.3.1               h2bc3f7f_2  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0            py39h7f8727e_0  
[conda] mkl_fft                   1.3.1            py39hd3c417c_0  
[conda] mkl_random                1.2.2            py39h51133e4_0  
[conda] mypy_extensions           0.4.3            py39h06a4308_1  
[conda] numpy                     1.20.3           py39hf144106_0  
[conda] numpy-base                1.20.3           py39h74d4b33_0  
[conda] numpydoc                  1.2                pyhd3eb1b0_0  
[conda] pytorch                   1.10.2          py3.9_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-msssim            0.2.1                    pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                0.10.2               py39_cu113    pytorch
[conda] torchvision               0.11.3               py39_cu113    pytorch

I’m wondering what causes this issue and how to solve it. Any help is really appreciated :slight_smile:

It seems the CUBLAS_STATUS_INTERNAL_ERROR changed to CUBLAS_STATUS_NOT_INITIALIZED which could point towards a setup issue.
Was this setup working in the past and if so, did you update the drivers without a reboot etc.?

I’m not really sure as this is my first time using the Nvidia A100 GPU.
If this is a setup issue, I’ll ask my administrator then about this issue. Thanks for the reply

I am using gpu. If I run to many instances (more then one training sessions) on the gpu could this happen?

I updated via ```
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html and got a RuntimeError: CUDA error: out of memory.

If this is helpful to anyone in the future.

Reduce the memory consumption by reducing the batch size or using lighter architecture. You can also reduce the image size or the token length if you working with text.