RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

Could you make sure that you are not running out of memory and cublas is failing to allocate some internal memory by e.g. lowering the batch size?
If that doesn’t help, could you post your setup via python -m torch.utils.collect_env as well as an executable code snippet to reproduce this issue, please?

Hi @ptrblck,
Thank you for the quick response!

  1. To check the for OOM I tried:
    1.1) Reducing the batch_size to 1. (Got the same error)
    1.2) Changed the num of model parameters by using a smaller pretrained model with batch_size equal to 1 (Got the same error)

2.) I’ve been mostly experimenting on colab, will a link to the notebook work?

I’ve run into a similar issue but I’m out of ideas. (On AWS with a g4dn.2xlarge instance) An almost identical code that I had seemed to work fine.


I also tried to run with a batch size of 1 but still seems to fail. PS: This code works completely fine if not using a GPU.

FIX: For some reason this was an issue with pytorch 1.8.0. I looked at this post RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)` while running fine on the CPU - #13 by ptrblck and tried to downgrade pytorch and it worked fine

Hi @ptrblck,
I’d like to take a shot at debugging this issue by myself.
Would you mind providing some guidance as to how to confirm if this is an issue with PyTorch itself or what part of PyTorch should I start looking at it start figuring out the cause.

I came across the same problem 2 days ago. Using pytorch 1.8.0 on my machine caused the same error, while using it on another machine works fine. On my machine I was using the pre-compiled version of pytorch (via pip), on the other machine I compiled pytorch myself with cuda 11.1.
I don’t know why the error occurs but I solved downgrading torch to 1.7.0.

If you are using a Turing GPU, try out the nightly binary, which should fix the missing sm_75 issue as described here and here.
CC @chatuur

1 Like

你好,我在测试时也遇到了相同的问题,使用CPU可以得到正确的结果,但使用GPU时就会报下面是的错误信息,是cuda版本问题吗?谢谢!

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasCreate(handle)

From Google translate:

Hello, I also encountered the same problem during the test. I can get the correct result when I use the CPU, but when I use the GPU, the following error message will be reported. Is it a cuda version problem? Thank you!

You might be hitting the previously mentioned error. Did you check the posts and tried to install the nightly?

PS: could you use an online translator before posting the message, please? :slight_smile:

Hello,

I am facing exactly the same error while trying to run the code on 2 x NVIDIA Tesla K40 using pytorch’s DataParallel().

My setup is: pytorch 1.7.0, cuda 10.1, python 3.7.6

The same code is running on 1 GPU. I also tried to set CUDA_LAUNCH_BLOCKING=1, but the code stucks at all.

Thank you!

Hi,
I have "RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)".
My code is running on another machine. I have this when run on a new machine. Two machines have the same GPUs.
###############Update:
1st machine gpus = 2080Ti
2nd machine (error) gpus = 1080 Ti.
Found Cuda version = 11.3 is not compatible. Then I down cuda version to 10.1. Problem solved.

If you’ve installed the PyTorch 1.8.1 pip wheels with CUDA11.1 and are using a Pascal GPU (sm_61), you might be hitting this issue.
So far we were able to isolate it to the library splitting and most likely a failure in the kernel lookup.
As a workaround you could install the conda binaries instead or the pip/conda binaries with CUDA10.2.

1 Like

Hi, I’ve encountered the same issue when running the following snippet of code:

       output = tt_embeddings.tt_forward(
            batch_count,
            B,
            D,
            tt_p_shapes,
            tt_q_shapes,
            tt_ranks,
            L,
            nnz_tt,
            indices,
            rowidx,
            list(ctx.tt_cores),
        )

And the error message & trace I’m getting is the following:

Traceback (most recent call last):
  File "dlrm_s_pytorch.py", line 1891, in <module>
    run()
  File "dlrm_s_pytorch.py", line 1570, in run
    ndevices=ndevices,
  File "dlrm_s_pytorch.py", line 138, in dlrm_wrap
    return dlrm(X.to(device), lS_o, lS_i)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "dlrm_s_pytorch.py", line 535, in forward
    return self.sequential_forward(dense_x, lS_o, lS_i)
  File "dlrm_s_pytorch.py", line 607, in sequential_forward
    ly = self.apply_emb(lS_o, lS_i, self.emb_l, self.v_W_l)
  File "dlrm_s_pytorch.py", line 438, in apply_emb
    V = E(sparse_index_group_batch,sparse_offset)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/dlrm_ttrec/tt_embeddings_ops.py", line 821, in forward
    *(self.tt_cores),
  File "/mnt/dlrm_ttrec/tt_embeddings_ops.py", line 185, in forward
    list(ctx.tt_cores),
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Exception raised from createCublasHandle at ../aten/src/ATen/cuda/CublasHandlePool.cpp:8 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7fe97ea4999b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x29a19dd (0x7fe843ee49dd in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #2: at::cuda::getCurrentCUDABlasHandle() + 0xd86 (0x7fe843ee5b36 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #3: tt_embeddings_forward_cuda(int, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, at::Tensor, int, at::Tensor, at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) + 0x840 (0x7fe830c19cf0 in /opt/conda/lib/python3.6/site-packages/tt_embeddings-0.0.0-py3.6-linux-x86_64.egg/tt_embeddings.cpython-36m-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x1ea49 (0x7fe830c0da49 in /opt/conda/lib/python3.6/site-packages/tt_embeddings-0.0.0-py3.6-linux-x86_64.egg/tt_embeddings.cpython-36m-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x19f87 (0x7fe830c08f87 in /opt/conda/lib/python3.6/site-packages/tt_embeddings-0.0.0-py3.6-linux-x86_64.egg/tt_embeddings.cpython-36m-x86_64-linux-gnu.so)
<omitting python frames>
frame #12: THPFunction_apply(_object*, _object*) + 0x986 (0x7fe94b8ff216 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

I’m wondering if you could give me some suggestions with regard to which part went wrong.

I’m using CUDA 10.2 with Pytorch 1.6. My GPU device is Tesla V100-SXM2-32GB

Thanks!

Could you check, if you are running out of memory and if so reduce e.g. the batch size of the workload?

Hi,
I have somehow similar problem. i was wondering if you could help me.
I had the same runtime error and ran my code with CUDA_LAUNCH_BLOCKING=1 and here is the output:

Reading config from config_wn18.yaml

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [79,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [80,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [81,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [82,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [83,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [84,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [85,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [86,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [87,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [88,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [89,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [90,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [91,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [92,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [93,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [94,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [62,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed.
torch.Size([128, 200])
torch.Size([128, 200])
torch.Size([128, 200])
torch.Size([128, 200])
Traceback (most recent call last):
File “train.py”, line 424, in
model.fit()
File “train.py”, line 389, in fit
train_loss = self.run_epoch(epoch, val_mrr)
File “train.py”, line 356, in run_epoch
pred = self.model.forward(sub, rel)
File “/content/drive/My Drive/CIPL/CompGCN-TransD/model/models.py”, line 182, in forward
emb_h = self._projection(sub_emb, h_m, r_m)
File “/content/drive/My Drive/CIPL/CompGCN-TransD/model/models.py”, line 64, in _projection
a = torch.sum(emb_e * emb_m, axis=-1, keepdims=True)
RuntimeError: CUDA error: device-side assert triggered

This error is raised by an invalid indexing operation, so you could check the shapes and values of the tensors in the indexing operation.

1 Like

Hi,
Faced with similar problem. wondering if someone could help .
runtime error says RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle):

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):

File “/home/gems/pytorch-env_py3.7/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 61, in _worker
output = module(*input, **kwargs)
File “/home/gems/pytorch-env_py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “”, line 56, in forward
x_scores = self.x_head(input_embeds)
File “/home/gems/pytorch-env_py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “/home/gems/pytorch-env_py3.7/lib/python3.7/site-packages/torch/nn/modules/linear.py”, line 96, in forward
return F.linear(input, self.weight, self.bias)
File “/home/gems/pytorch-env_py3.7/lib/python3.7/site-packages/torch/nn/functional.py”, line 1847, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

You are most likely running out of memory, so would need to reduce the memory usage e.g. via decreasing the batch size.

Hello, I have a similar issue. I’m using BertForSequenceClassification and getting RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED WHEN CALLING ‘cublasCreate(handle)’. I see the same error with a batch size of 1, and I do not have any errors on the CPU. I’m running pytorch 1.11.0 and cudatoolkit 11.3.1 on a Tesla T4. Do you have any suggestions for how to fix this? Thanks!

It seems you are also running out of memory and would need to reduce the memory usage e.g. by using a smaller model or by trading compute for memory via torch.utils.checkpoint.

I have met the same error of CUBLAS_STATUS_ALLOC_FAILED.
I was running a 4 nodes distributed training on many GPU nodes(20 total). I was using batch-size=12, on some of these nodes, I got CUBLAS_STATUS_ALLOC_FAILED

  1. If I set CUDA_LAUNCH_BLOCKING=1, it runs without any error (but performance down).
  2. If I set batch-size to 6, it runs without any error
  3. On some nodes, when I use large batch-size=16, it can run without any error, but on some other nodes, batch-size can only be 6 (larger batch-size will cause CUBLAS_STATUS_ALLOC_FAILED)
  4. when I use standalong training with large batch-size=16, all nodes can runs without any error.

All my nodes have the same settings including: driver and cuda/cudnn version, same GPU(V100 with 32GB memory), same OS(centOS), same disk and CPU and momory.

I am using cuda 11.5 and torch-1.12.1+cu115. nccl=2.10, distributed-training use NCCL backend with IB. The training job is training megatron GPT model.

How can I solve the error ? Why some nodes can hold larger batch-size without any error but the other nodes can’t ? Why CUDA_LAUNCH_BLOCKING can avoid this error (but get performance down which I can not accept) ?