RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

Hi, the similar issue. I’d be grateful if you could help me. Works well with CPU.
GPU: Geforce MX250.

My traceback:

Traceback (most recent call last):
  File "C:\Kamil\readTheImage.py", line 282, in <module>
  File "C:\Kamil\readTheImage.py", line 169, in ocrPlusWarunki
  File "C:\Kamil\readTheImage.py", line 108, in readImage
  File "easyocr\easyocr.py", line 368, in readtext
    result = self.recognize(img_cv_grey, horizontal_list, free_list,\
  File "easyocr\easyocr.py", line 324, in recognize
    result = get_text(self.character, imgH, int(max_width), self.recognizer, self.converter, image_list,\
  File "easyocr\recognition.py", line 189, in get_text
    result1 = recognizer_predict(recognizer, converter, test_loader,batch_max_length,\
  File "easyocr\recognition.py", line 108, in recognizer_predict
    preds = model(image, text_for_pred)
  File "torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "torch\nn\parallel\data_parallel.py", line 159, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\Kamil\Downloads\Project\readTheImage\readTheImage\easyocr\model\model.py", line 30, in forward
    contextual_feature = self.SequenceModeling(visual_feature)
  File "torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "torch\nn\modules\container.py", line 117, in forward
    input = module(input)
  File "torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "easyocr\model\modules.py", line 89, in forward
    output = self.linear(recurrent)  # batch_size x T x output_size
  File "torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "torch\nn\modules\linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
  File "torch\nn\functional.py", line 1692, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

line 169:

table = readImage('.\\temp123\\table'+ str(n)+'.png')

lines 107-109:

def readImage(imgAddress):
    result = reader.readtext(imgAddress, detail = 0)
    return(result)

This error might be raised if you are running out of memory, so did you check, if this might be the case?

I’m not sure if It might be the case. The script takes almost all GPU memory if I see correctly.
Sometimes the script just frozen after a while or just crashed with the mentioned error (I have to restart PC when it crashed). There is no problem using only CPU. How can I deal with that?

When app crashed and crashed, 2-3s after turning the script on:

When app is running (no errors for some time):

You could reduce the batch size and check which would be the max. size which doesn’t raise this issue.

Could you say where I can change the batch size? I don’t have it in my code.

Since you are using nn.Linear layers in your model, the input shape to these layers should be [batch_size, features] and you could try to reduce the size of dim0.

Do you mean this?

API Documentantion of EasyOCR
readtext method
batchsize (int, default = 1) - batchsize>1 will make EasyOCR faster but use more memory

Unfortunately I can’t set less than 1.

Your GPU seems to have 2GB of memory, which might then not be enough for the currently used model.
You could try to use e.g. Google Colab and use the free GPUs with more memory.

But the script only reads small images (just 65x25) one by one, how could 2GB of memory be not enough?

Images like this:
examples

The GPU memory is needed for the CUDA context (which contains the runtime, kernels etc.), the inputs, model parameters, intermediate forward activations (during training), the gradients, and optimizer running estimates (in case you are using such as optimizer), so the image size by itself might fit, the overall training might not.

You could run a quick check with a tiny model (e.g. two linear layers) and check what the max. size would be.

Hi, I’m getting the same error.
I’m using the GPT2ForSequenceClassification from transformers by hugginface.

import torch
print(torch.__version__)

# out
1.7.1+cu101
import transformers 
print(transformers.__version__)

#out
4.3.3

! nvidia-smi
#out
NVIDIA-SMI 460.39       Driver Version: 460.32.03    CUDA Version: 11.2     |

I tried running this on colab as well as on another machine with higher RAM.
This is the stack trace:

RuntimeError                              Traceback (most recent call last)

<ipython-input-19-3435b262f1ae> in <module>()
----> 1 trainer.train()

12 frames

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
    938                         tr_loss += self.training_step(model, inputs)
    939                 else:
--> 940                     tr_loss += self.training_step(model, inputs)
    941                 self._total_flos += self.floating_point_ops(inputs)
    942 

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in training_step(self, model, inputs)
   1302                 loss = self.compute_loss(model, inputs)
   1303         else:
-> 1304             loss = self.compute_loss(model, inputs)
   1305 
   1306         if self.args.n_gpu > 1:

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
   1332         else:
   1333             labels = None
-> 1334         outputs = model(**inputs)
   1335         # Save past state if it exists
   1336         # TODO: this needs to be fixed and made cleaner later.

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py in forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1206             output_attentions=output_attentions,
   1207             output_hidden_states=output_hidden_states,
-> 1208             return_dict=return_dict,
   1209         )
   1210         hidden_states = transformer_outputs[0]

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py in forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, use_cache, output_attentions, output_hidden_states, return_dict)
    753                     encoder_attention_mask=encoder_attention_mask,
    754                     use_cache=use_cache,
--> 755                     output_attentions=output_attentions,
    756                 )
    757 

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py in forward(self, hidden_states, layer_past, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, use_cache, output_attentions)
    293             head_mask=head_mask,
    294             use_cache=use_cache,
--> 295             output_attentions=output_attentions,
    296         )
    297         attn_output = attn_outputs[0]  # output_attn: a, present, (attentions)

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py in forward(self, hidden_states, layer_past, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, use_cache, output_attentions)
    223             attention_mask = encoder_attention_mask
    224         else:
--> 225             query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
    226 
    227         query = self.split_heads(query)

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py in forward(self, x)
   1204     def forward(self, x):
   1205         size_out = x.size()[:-1] + (self.nf,)
-> 1206         x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
   1207         x = x.view(*size_out)
   1208         return x

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

 x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)

From the stack trace, this is the line that’s causing the error. This line belongs to the follwing class:

class Conv1D(nn.Module):
    """
    1D-convolutional layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2).

    Basically works like a linear layer but the weights are transposed.

    Args:
        nf (:obj:`int`): The number of output features.
        nx (:obj:`int`): The number of input features.
    """

    def __init__(self, nf, nx):
        super().__init__()
        self.nf = nf
        w = torch.empty(nx, nf)
        nn.init.normal_(w, std=0.02)
        self.weight = nn.Parameter(w)
        self.bias = nn.Parameter(torch.zeros(nf))

    def forward(self, x):
        size_out = x.size()[:-1] + (self.nf,)
        x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
        x = x.view(*size_out)
        return x

I tried creating a random tensor and passing it to the Conv1D class but that ran fine. Not sure if that helps in narrowing down where the problem is:

nx = 768
n_state = nx
conv = Conv1D_i(3 * n_state, nx)
hidden_states = torch.randn([16, 1024, 768])
conv(hidden_states) # This runs fine. 

Could you make sure that you are not running out of memory and cublas is failing to allocate some internal memory by e.g. lowering the batch size?
If that doesn’t help, could you post your setup via python -m torch.utils.collect_env as well as an executable code snippet to reproduce this issue, please?

Hi @ptrblck,
Thank you for the quick response!

  1. To check the for OOM I tried:
    1.1) Reducing the batch_size to 1. (Got the same error)
    1.2) Changed the num of model parameters by using a smaller pretrained model with batch_size equal to 1 (Got the same error)

2.) I’ve been mostly experimenting on colab, will a link to the notebook work?

I’ve run into a similar issue but I’m out of ideas. (On AWS with a g4dn.2xlarge instance) An almost identical code that I had seemed to work fine.


I also tried to run with a batch size of 1 but still seems to fail. PS: This code works completely fine if not using a GPU.

FIX: For some reason this was an issue with pytorch 1.8.0. I looked at this post RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)` while running fine on the CPU - #13 by ptrblck and tried to downgrade pytorch and it worked fine

Hi @ptrblck,
I’d like to take a shot at debugging this issue by myself.
Would you mind providing some guidance as to how to confirm if this is an issue with PyTorch itself or what part of PyTorch should I start looking at it start figuring out the cause.

I came across the same problem 2 days ago. Using pytorch 1.8.0 on my machine caused the same error, while using it on another machine works fine. On my machine I was using the pre-compiled version of pytorch (via pip), on the other machine I compiled pytorch myself with cuda 11.1.
I don’t know why the error occurs but I solved downgrading torch to 1.7.0.

If you are using a Turing GPU, try out the nightly binary, which should fix the missing sm_75 issue as described here and here.
CC @chatuur

1 Like

你好,我在测试时也遇到了相同的问题,使用CPU可以得到正确的结果,但使用GPU时就会报下面是的错误信息,是cuda版本问题吗?谢谢!

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasCreate(handle)

From Google translate:

Hello, I also encountered the same problem during the test. I can get the correct result when I use the CPU, but when I use the GPU, the following error message will be reported. Is it a cuda version problem? Thank you!

You might be hitting the previously mentioned error. Did you check the posts and tried to install the nightly?

PS: could you use an online translator before posting the message, please? :slight_smile:

Hello,

I am facing exactly the same error while trying to run the code on 2 x NVIDIA Tesla K40 using pytorch’s DataParallel().

My setup is: pytorch 1.7.0, cuda 10.1, python 3.7.6

The same code is running on 1 GPU. I also tried to set CUDA_LAUNCH_BLOCKING=1, but the code stucks at all.

Thank you!