RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

But the script only reads small images (just 65x25) one by one, how could 2GB of memory be not enough?

Images like this:
examples

The GPU memory is needed for the CUDA context (which contains the runtime, kernels etc.), the inputs, model parameters, intermediate forward activations (during training), the gradients, and optimizer running estimates (in case you are using such as optimizer), so the image size by itself might fit, the overall training might not.

You could run a quick check with a tiny model (e.g. two linear layers) and check what the max. size would be.

Hi, I’m getting the same error.
I’m using the GPT2ForSequenceClassification from transformers by hugginface.

import torch
print(torch.__version__)

# out
1.7.1+cu101
import transformers 
print(transformers.__version__)

#out
4.3.3

! nvidia-smi
#out
NVIDIA-SMI 460.39       Driver Version: 460.32.03    CUDA Version: 11.2     |

I tried running this on colab as well as on another machine with higher RAM.
This is the stack trace:

RuntimeError                              Traceback (most recent call last)

<ipython-input-19-3435b262f1ae> in <module>()
----> 1 trainer.train()

12 frames

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
    938                         tr_loss += self.training_step(model, inputs)
    939                 else:
--> 940                     tr_loss += self.training_step(model, inputs)
    941                 self._total_flos += self.floating_point_ops(inputs)
    942 

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in training_step(self, model, inputs)
   1302                 loss = self.compute_loss(model, inputs)
   1303         else:
-> 1304             loss = self.compute_loss(model, inputs)
   1305 
   1306         if self.args.n_gpu > 1:

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
   1332         else:
   1333             labels = None
-> 1334         outputs = model(**inputs)
   1335         # Save past state if it exists
   1336         # TODO: this needs to be fixed and made cleaner later.

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py in forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1206             output_attentions=output_attentions,
   1207             output_hidden_states=output_hidden_states,
-> 1208             return_dict=return_dict,
   1209         )
   1210         hidden_states = transformer_outputs[0]

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py in forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, use_cache, output_attentions, output_hidden_states, return_dict)
    753                     encoder_attention_mask=encoder_attention_mask,
    754                     use_cache=use_cache,
--> 755                     output_attentions=output_attentions,
    756                 )
    757 

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py in forward(self, hidden_states, layer_past, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, use_cache, output_attentions)
    293             head_mask=head_mask,
    294             use_cache=use_cache,
--> 295             output_attentions=output_attentions,
    296         )
    297         attn_output = attn_outputs[0]  # output_attn: a, present, (attentions)

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py in forward(self, hidden_states, layer_past, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, use_cache, output_attentions)
    223             attention_mask = encoder_attention_mask
    224         else:
--> 225             query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
    226 
    227         query = self.split_heads(query)

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py in forward(self, x)
   1204     def forward(self, x):
   1205         size_out = x.size()[:-1] + (self.nf,)
-> 1206         x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
   1207         x = x.view(*size_out)
   1208         return x

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

 x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)

From the stack trace, this is the line that’s causing the error. This line belongs to the follwing class:

class Conv1D(nn.Module):
    """
    1D-convolutional layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2).

    Basically works like a linear layer but the weights are transposed.

    Args:
        nf (:obj:`int`): The number of output features.
        nx (:obj:`int`): The number of input features.
    """

    def __init__(self, nf, nx):
        super().__init__()
        self.nf = nf
        w = torch.empty(nx, nf)
        nn.init.normal_(w, std=0.02)
        self.weight = nn.Parameter(w)
        self.bias = nn.Parameter(torch.zeros(nf))

    def forward(self, x):
        size_out = x.size()[:-1] + (self.nf,)
        x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
        x = x.view(*size_out)
        return x

I tried creating a random tensor and passing it to the Conv1D class but that ran fine. Not sure if that helps in narrowing down where the problem is:

nx = 768
n_state = nx
conv = Conv1D_i(3 * n_state, nx)
hidden_states = torch.randn([16, 1024, 768])
conv(hidden_states) # This runs fine. 

Could you make sure that you are not running out of memory and cublas is failing to allocate some internal memory by e.g. lowering the batch size?
If that doesn’t help, could you post your setup via python -m torch.utils.collect_env as well as an executable code snippet to reproduce this issue, please?

Hi @ptrblck,
Thank you for the quick response!

  1. To check the for OOM I tried:
    1.1) Reducing the batch_size to 1. (Got the same error)
    1.2) Changed the num of model parameters by using a smaller pretrained model with batch_size equal to 1 (Got the same error)

2.) I’ve been mostly experimenting on colab, will a link to the notebook work?

I’ve run into a similar issue but I’m out of ideas. (On AWS with a g4dn.2xlarge instance) An almost identical code that I had seemed to work fine.


I also tried to run with a batch size of 1 but still seems to fail. PS: This code works completely fine if not using a GPU.

FIX: For some reason this was an issue with pytorch 1.8.0. I looked at this post RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)` while running fine on the CPU - #13 by ptrblck and tried to downgrade pytorch and it worked fine

Hi @ptrblck,
I’d like to take a shot at debugging this issue by myself.
Would you mind providing some guidance as to how to confirm if this is an issue with PyTorch itself or what part of PyTorch should I start looking at it start figuring out the cause.

I came across the same problem 2 days ago. Using pytorch 1.8.0 on my machine caused the same error, while using it on another machine works fine. On my machine I was using the pre-compiled version of pytorch (via pip), on the other machine I compiled pytorch myself with cuda 11.1.
I don’t know why the error occurs but I solved downgrading torch to 1.7.0.

If you are using a Turing GPU, try out the nightly binary, which should fix the missing sm_75 issue as described here and here.
CC @chatuur

1 Like

你好,我在测试时也遇到了相同的问题,使用CPU可以得到正确的结果,但使用GPU时就会报下面是的错误信息,是cuda版本问题吗?谢谢!

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasCreate(handle)

From Google translate:

Hello, I also encountered the same problem during the test. I can get the correct result when I use the CPU, but when I use the GPU, the following error message will be reported. Is it a cuda version problem? Thank you!

You might be hitting the previously mentioned error. Did you check the posts and tried to install the nightly?

PS: could you use an online translator before posting the message, please? :slight_smile:

Hello,

I am facing exactly the same error while trying to run the code on 2 x NVIDIA Tesla K40 using pytorch’s DataParallel().

My setup is: pytorch 1.7.0, cuda 10.1, python 3.7.6

The same code is running on 1 GPU. I also tried to set CUDA_LAUNCH_BLOCKING=1, but the code stucks at all.

Thank you!

Hi,
I have "RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)".
My code is running on another machine. I have this when run on a new machine. Two machines have the same GPUs.
###############Update:
1st machine gpus = 2080Ti
2nd machine (error) gpus = 1080 Ti.
Found Cuda version = 11.3 is not compatible. Then I down cuda version to 10.1. Problem solved.

If you’ve installed the PyTorch 1.8.1 pip wheels with CUDA11.1 and are using a Pascal GPU (sm_61), you might be hitting this issue.
So far we were able to isolate it to the library splitting and most likely a failure in the kernel lookup.
As a workaround you could install the conda binaries instead or the pip/conda binaries with CUDA10.2.

1 Like

Hi, I’ve encountered the same issue when running the following snippet of code:

       output = tt_embeddings.tt_forward(
            batch_count,
            B,
            D,
            tt_p_shapes,
            tt_q_shapes,
            tt_ranks,
            L,
            nnz_tt,
            indices,
            rowidx,
            list(ctx.tt_cores),
        )

And the error message & trace I’m getting is the following:

Traceback (most recent call last):
  File "dlrm_s_pytorch.py", line 1891, in <module>
    run()
  File "dlrm_s_pytorch.py", line 1570, in run
    ndevices=ndevices,
  File "dlrm_s_pytorch.py", line 138, in dlrm_wrap
    return dlrm(X.to(device), lS_o, lS_i)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "dlrm_s_pytorch.py", line 535, in forward
    return self.sequential_forward(dense_x, lS_o, lS_i)
  File "dlrm_s_pytorch.py", line 607, in sequential_forward
    ly = self.apply_emb(lS_o, lS_i, self.emb_l, self.v_W_l)
  File "dlrm_s_pytorch.py", line 438, in apply_emb
    V = E(sparse_index_group_batch,sparse_offset)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/dlrm_ttrec/tt_embeddings_ops.py", line 821, in forward
    *(self.tt_cores),
  File "/mnt/dlrm_ttrec/tt_embeddings_ops.py", line 185, in forward
    list(ctx.tt_cores),
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Exception raised from createCublasHandle at ../aten/src/ATen/cuda/CublasHandlePool.cpp:8 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7fe97ea4999b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x29a19dd (0x7fe843ee49dd in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #2: at::cuda::getCurrentCUDABlasHandle() + 0xd86 (0x7fe843ee5b36 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #3: tt_embeddings_forward_cuda(int, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, at::Tensor, int, at::Tensor, at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) + 0x840 (0x7fe830c19cf0 in /opt/conda/lib/python3.6/site-packages/tt_embeddings-0.0.0-py3.6-linux-x86_64.egg/tt_embeddings.cpython-36m-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x1ea49 (0x7fe830c0da49 in /opt/conda/lib/python3.6/site-packages/tt_embeddings-0.0.0-py3.6-linux-x86_64.egg/tt_embeddings.cpython-36m-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x19f87 (0x7fe830c08f87 in /opt/conda/lib/python3.6/site-packages/tt_embeddings-0.0.0-py3.6-linux-x86_64.egg/tt_embeddings.cpython-36m-x86_64-linux-gnu.so)
<omitting python frames>
frame #12: THPFunction_apply(_object*, _object*) + 0x986 (0x7fe94b8ff216 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

I’m wondering if you could give me some suggestions with regard to which part went wrong.

I’m using CUDA 10.2 with Pytorch 1.6. My GPU device is Tesla V100-SXM2-32GB

Thanks!

Could you check, if you are running out of memory and if so reduce e.g. the batch size of the workload?

Hi,
I have somehow similar problem. i was wondering if you could help me.
I had the same runtime error and ran my code with CUDA_LAUNCH_BLOCKING=1 and here is the output:

Reading config from config_wn18.yaml

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [79,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [80,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [81,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [82,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [83,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [84,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [85,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [86,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [87,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [88,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [89,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [90,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [91,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [92,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [93,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [94,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [59,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [62,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [46,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed.
torch.Size([128, 200])
torch.Size([128, 200])
torch.Size([128, 200])
torch.Size([128, 200])
Traceback (most recent call last):
File “train.py”, line 424, in
model.fit()
File “train.py”, line 389, in fit
train_loss = self.run_epoch(epoch, val_mrr)
File “train.py”, line 356, in run_epoch
pred = self.model.forward(sub, rel)
File “/content/drive/My Drive/CIPL/CompGCN-TransD/model/models.py”, line 182, in forward
emb_h = self._projection(sub_emb, h_m, r_m)
File “/content/drive/My Drive/CIPL/CompGCN-TransD/model/models.py”, line 64, in _projection
a = torch.sum(emb_e * emb_m, axis=-1, keepdims=True)
RuntimeError: CUDA error: device-side assert triggered

This error is raised by an invalid indexing operation, so you could check the shapes and values of the tensors in the indexing operation.

1 Like

Hi,
Faced with similar problem. wondering if someone could help .
runtime error says RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle):

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):

File “/home/gems/pytorch-env_py3.7/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 61, in _worker
output = module(*input, **kwargs)
File “/home/gems/pytorch-env_py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “”, line 56, in forward
x_scores = self.x_head(input_embeds)
File “/home/gems/pytorch-env_py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “/home/gems/pytorch-env_py3.7/lib/python3.7/site-packages/torch/nn/modules/linear.py”, line 96, in forward
return F.linear(input, self.weight, self.bias)
File “/home/gems/pytorch-env_py3.7/lib/python3.7/site-packages/torch/nn/functional.py”, line 1847, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

You are most likely running out of memory, so would need to reduce the memory usage e.g. via decreasing the batch size.