RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

Hi,
I have "RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)".
My code is running on another machine. I have this when run on a new machine. Two machines have the same GPUs.
###############Update:
1st machine gpus = 2080Ti
2nd machine (error) gpus = 1080 Ti.
Found Cuda version = 11.3 is not compatible. Then I down cuda version to 10.1. Problem solved.

If you’ve installed the PyTorch 1.8.1 pip wheels with CUDA11.1 and are using a Pascal GPU (sm_61), you might be hitting this issue.
So far we were able to isolate it to the library splitting and most likely a failure in the kernel lookup.
As a workaround you could install the conda binaries instead or the pip/conda binaries with CUDA10.2.

1 Like

Hi, I’ve encountered the same issue when running the following snippet of code:

       output = tt_embeddings.tt_forward(
            batch_count,
            B,
            D,
            tt_p_shapes,
            tt_q_shapes,
            tt_ranks,
            L,
            nnz_tt,
            indices,
            rowidx,
            list(ctx.tt_cores),
        )

And the error message & trace I’m getting is the following:

Traceback (most recent call last):
  File "dlrm_s_pytorch.py", line 1891, in <module>
    run()
  File "dlrm_s_pytorch.py", line 1570, in run
    ndevices=ndevices,
  File "dlrm_s_pytorch.py", line 138, in dlrm_wrap
    return dlrm(X.to(device), lS_o, lS_i)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "dlrm_s_pytorch.py", line 535, in forward
    return self.sequential_forward(dense_x, lS_o, lS_i)
  File "dlrm_s_pytorch.py", line 607, in sequential_forward
    ly = self.apply_emb(lS_o, lS_i, self.emb_l, self.v_W_l)
  File "dlrm_s_pytorch.py", line 438, in apply_emb
    V = E(sparse_index_group_batch,sparse_offset)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/dlrm_ttrec/tt_embeddings_ops.py", line 821, in forward
    *(self.tt_cores),
  File "/mnt/dlrm_ttrec/tt_embeddings_ops.py", line 185, in forward
    list(ctx.tt_cores),
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Exception raised from createCublasHandle at ../aten/src/ATen/cuda/CublasHandlePool.cpp:8 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7fe97ea4999b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x29a19dd (0x7fe843ee49dd in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #2: at::cuda::getCurrentCUDABlasHandle() + 0xd86 (0x7fe843ee5b36 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #3: tt_embeddings_forward_cuda(int, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, at::Tensor, int, at::Tensor, at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) + 0x840 (0x7fe830c19cf0 in /opt/conda/lib/python3.6/site-packages/tt_embeddings-0.0.0-py3.6-linux-x86_64.egg/tt_embeddings.cpython-36m-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x1ea49 (0x7fe830c0da49 in /opt/conda/lib/python3.6/site-packages/tt_embeddings-0.0.0-py3.6-linux-x86_64.egg/tt_embeddings.cpython-36m-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x19f87 (0x7fe830c08f87 in /opt/conda/lib/python3.6/site-packages/tt_embeddings-0.0.0-py3.6-linux-x86_64.egg/tt_embeddings.cpython-36m-x86_64-linux-gnu.so)
<omitting python frames>
frame #12: THPFunction_apply(_object*, _object*) + 0x986 (0x7fe94b8ff216 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

I’m wondering if you could give me some suggestions with regard to which part went wrong.

I’m using CUDA 10.2 with Pytorch 1.6. My GPU device is Tesla V100-SXM2-32GB

Thanks!

Could you check, if you are running out of memory and if so reduce e.g. the batch size of the workload?

Hi,
I have somehow similar problem. i was wondering if you could help me.
I had the same runtime error and ran my code with CUDA_LAUNCH_BLOCKING=1 and here is the output:

Reading config from config_wn18.yaml

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [79,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [80,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [81,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [82,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [83,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [84,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [85,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [86,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [87,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [88,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [89,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [90,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [91,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [92,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [93,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [94,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [62,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed.
torch.Size([128, 200])
torch.Size([128, 200])
torch.Size([128, 200])
torch.Size([128, 200])
Traceback (most recent call last):
File “train.py”, line 424, in
model.fit()
File “train.py”, line 389, in fit
train_loss = self.run_epoch(epoch, val_mrr)
File “train.py”, line 356, in run_epoch
pred = self.model.forward(sub, rel)
File “/content/drive/My Drive/CIPL/CompGCN-TransD/model/models.py”, line 182, in forward
emb_h = self._projection(sub_emb, h_m, r_m)
File “/content/drive/My Drive/CIPL/CompGCN-TransD/model/models.py”, line 64, in _projection
a = torch.sum(emb_e * emb_m, axis=-1, keepdims=True)
RuntimeError: CUDA error: device-side assert triggered

This error is raised by an invalid indexing operation, so you could check the shapes and values of the tensors in the indexing operation.

1 Like

Hi,
Faced with similar problem. wondering if someone could help .
runtime error says RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle):

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):

File “/home/gems/pytorch-env_py3.7/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 61, in _worker
output = module(*input, **kwargs)
File “/home/gems/pytorch-env_py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “”, line 56, in forward
x_scores = self.x_head(input_embeds)
File “/home/gems/pytorch-env_py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “/home/gems/pytorch-env_py3.7/lib/python3.7/site-packages/torch/nn/modules/linear.py”, line 96, in forward
return F.linear(input, self.weight, self.bias)
File “/home/gems/pytorch-env_py3.7/lib/python3.7/site-packages/torch/nn/functional.py”, line 1847, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

You are most likely running out of memory, so would need to reduce the memory usage e.g. via decreasing the batch size.

Hello, I have a similar issue. I’m using BertForSequenceClassification and getting RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED WHEN CALLING ‘cublasCreate(handle)’. I see the same error with a batch size of 1, and I do not have any errors on the CPU. I’m running pytorch 1.11.0 and cudatoolkit 11.3.1 on a Tesla T4. Do you have any suggestions for how to fix this? Thanks!

It seems you are also running out of memory and would need to reduce the memory usage e.g. by using a smaller model or by trading compute for memory via torch.utils.checkpoint.

I have met the same error of CUBLAS_STATUS_ALLOC_FAILED.
I was running a 4 nodes distributed training on many GPU nodes(20 total). I was using batch-size=12, on some of these nodes, I got CUBLAS_STATUS_ALLOC_FAILED

  1. If I set CUDA_LAUNCH_BLOCKING=1, it runs without any error (but performance down).
  2. If I set batch-size to 6, it runs without any error
  3. On some nodes, when I use large batch-size=16, it can run without any error, but on some other nodes, batch-size can only be 6 (larger batch-size will cause CUBLAS_STATUS_ALLOC_FAILED)
  4. when I use standalong training with large batch-size=16, all nodes can runs without any error.

All my nodes have the same settings including: driver and cuda/cudnn version, same GPU(V100 with 32GB memory), same OS(centOS), same disk and CPU and momory.

I am using cuda 11.5 and torch-1.12.1+cu115. nccl=2.10, distributed-training use NCCL backend with IB. The training job is training megatron GPT model.

How can I solve the error ? Why some nodes can hold larger batch-size without any error but the other nodes can’t ? Why CUDA_LAUNCH_BLOCKING can avoid this error (but get performance down which I can not accept) ?

That’s unclear and you should try to debug this artifact e.g by checking if some ranks use larger inputs or create additional objects. It could also be worth checking if your script is creating unnecessary additional CUDA contexts on some devices which would also waste memory.

@ptrblck
Hello, I am facing a similar problem, and I created a topic fo rit, I would love your help:
https://discuss.pytorch.org/t/facebook-bart-fine-tuning-transformers-cuda-error-cublas-status-not-initialize/178641