RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

I’m using BertForSequenceClassifcation by huggingface for multi-class classification over 50 classes.

When I try to train my model, I get the runtime error precisely at the line indicated below:

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels = 50, 
    output_attentions = False, 
    output_hidden_states = False, 
)

for step, batch in enumerate(train_dataloader):

        b_texts = batch[0].to(device)
        b_attention_masks = batch[1].to(device)
        b_authors = batch[2].to(device)

        model.zero_grad()        

        outputs = model(b_texts, 
                        token_type_ids=None, 
                        attention_mask=b_attention_masks, 
                        labels=b_authors)  <---- ERROR HERE

Could you run your code with:

CUDA_LAUNCH_BLOCKING=1 python script.py args

and post the stack trace here, please?
Also, does your code run on the CPU without any errors?

2 Likes

Noob question, but how do a post the stack trace in a jupyter notebook? I set the env with %env CUDA_LAUNCH_BLOCKING=1 and ran the cell, but didn’t get anything that resembled a stack trace

I’m not sure, as I’m not using Jupyter notebooks and often saw the behavior of restarting the kernel before printing out the stack trace. I would recommend to run the script in a terminal, which will print the stack trace.

Here’s what it outputs:

Traceback (most recent call last):
  File "scratch2.py", line 193, in <module>
    labels=b_authors)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/transformers/modeling_bert.py", line 1176, in forward
    inputs_embeds=inputs_embeds,
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/transformers/modeling_bert.py", line 783, in forward
    input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/transformers/modeling_bert.py", line 174, in forward
    position_embeddings = self.position_embeddings(position_ids)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 118, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/functional.py", line 1454, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range at /Users/administrator/nightlies/pytorch-1.0.0/wheel_build_dirs/wheel_3.7/pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:191

Your embedding layer is getting invalid indices. Make sure you are passing inputs with values in the range [0, num_embedding-1].

1 Like

Hi,
I have somehow similar problem. i was wondering if you could help me.
I had the same runtime error and ran my code with CUDA_LAUNCH_BLOCKING=1 and here is the output:

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu line=110 error=710 : device-side assert triggered
Traceback (most recent call last):
  File "MUSK2.py", line 215, in <module>
    loss = train(epoch, train_loader, model, criterion, optimizer)
  File "MUSK2.py", line 158, in train
    loss = criterion(output, target)
  File "/local-scratch/localhome/msaberia/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/local-scratch/localhome/msaberia/pytorch/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 916, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/local-scratch/localhome/msaberia/pytorch/lib/python3.6/site-packages/torch/nn/functional.py", line 2021, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/local-scratch/localhome/msaberia/pytorch/lib/python3.6/site-packages/torch/nn/functional.py", line 1838, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:110
1 Like

The error points to an out of bounds index for nn.NLLLoss (or nn.CrossEntropyLoss):

Assertion `t >= 0 && t < n_classes` failed.

Make sure to pass the target with values in the range [0, nb_classes-1].

6 Likes

that was it!
got it! thanks! :slight_smile:

Hi,
I have somehow similar problem. i was wondering if you could help me.
I had the same runtime error and ran my code with CUDA_LAUNCH_BLOCKING=1 and here is the output:

Traceback (most recent call last):
File “MINE_main.py”, line 62, in
result = util.train((x, y), mine_net,mine_net_optim, checkpoint_template, iter_num = iter_num, save_model=False, verbose = True)
File “/nas/home/jiazli/Bias_assessment/model/MINE/util.py”, line 77, in train
mi_lb, ma_et = learn_mine(batch, mine_net, mine_net_optim, ma_et)
File “/nas/home/jiazli/Bias_assessment/model/MINE/util.py”, line 32, in learn_mine
mi_lb, t, et = mutual_information(joint, marginal, mine_net)
File “/nas/home/jiazli/Bias_assessment/model/MINE/util.py”, line 18, in mutual_information
t = mine_net(joint)
File “/nas/home/jiazli/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 722, in _call_impl
result = self.forward(*input, **kwargs)
File “/nas/home/jiazli/Bias_assessment/model/MINE/MINE.py”, line 19, in forward
output = F.elu(self.fc1(input))
File “/nas/home/jiazli/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 722, in _call_impl
result = self.forward(*input, **kwargs)
File “/nas/home/jiazli/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/linear.py”, line 91, in forward
return F.linear(input, self.weight, self.bias)
File “/nas/home/jiazli/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/functional.py”, line 1674, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

Could you post an executable code snippet to reproduce this issue as well as your PyTorch, CUDA version, and used GPU?

Thank you very much. When I update the latest version, the problem is solved.

Hello,

I also run into the same problem these days and not sure how to tackle it. I have already set CUDA_LAUNCH_BLOCKING=1 but I still receive the same error. Seems like it’s because the number of labels and number of output units is not equal? but I’m not sure how to verify this.

Could you please take a look at the error below and I would really appreciate if you could give me some suggestions. I’m quite new with PyTorch and deep learning so please bear with me.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<timed eval> in <module>

~/.local/lib/python3.6/site-packages/transformers/trainer.py in train(self, model_path, trial)
    761                     continue
    762 
--> 763                 tr_loss += self.training_step(model, inputs)
    764                 self.total_flos += self.floating_point_ops(inputs)
    765 

~/.local/lib/python3.6/site-packages/transformers/trainer.py in training_step(self, model, inputs)
   1111                 loss = self.compute_loss(model, inputs)
   1112         else:
-> 1113             loss = self.compute_loss(model, inputs)
   1114 
   1115         if self.args.n_gpu > 1:

~/.local/lib/python3.6/site-packages/transformers/trainer.py in compute_loss(self, model, inputs)
   1135         Subclass and override for custom behavior.
   1136         """
-> 1137         outputs = model(**inputs)
   1138         # Save past state if it exists
   1139         if self.args.past_index >= 0:

~/.local/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

~/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
    159             return self.module(*inputs[0], **kwargs[0])
    160         replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
--> 161         outputs = self.parallel_apply(replicas, inputs, kwargs)
    162         return self.gather(outputs, self.output_device)
    163 

~/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
    169 
    170     def parallel_apply(self, replicas, inputs, kwargs):
--> 171         return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
    172 
    173     def gather(self, outputs, output_device):

~/.local/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py in parallel_apply(modules, inputs, kwargs_tup, devices)
     84         output = results[i]
     85         if isinstance(output, ExceptionWrapper):
---> 86             output.reraise()
     87         outputs.append(output)
     88     return outputs

~/.local/lib/python3.6/site-packages/torch/_utils.py in reraise(self)
    426             # have message field
    427             raise self.exc_type(message=msg)
--> 428         raise self.exc_type(msg)
    429 
    430 

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/tlqn/.local/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/tlqn/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/tlqn/.local/lib/python3.6/site-packages/transformers/modeling_albert.py", line 796, in forward
    return_dict=return_dict,
  File "/home/tlqn/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/tlqn/.local/lib/python3.6/site-packages/transformers/modeling_albert.py", line 690, in forward
    return_dict=return_dict,
  File "/home/tlqn/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/tlqn/.local/lib/python3.6/site-packages/transformers/modeling_albert.py", line 421, in forward
    hidden_states = self.embedding_hidden_mapping_in(hidden_states)
  File "/home/tlqn/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/tlqn/.local/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/tlqn/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 1692, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

The error might also be raised, if you are running out of memory on the GPU and cublas is unable to create the handle.
Could you reduce the batch size and check, if the code is working?

I have more than 50% of the memory on GPU at the moment, I reset the kernel and even reduced the batch size to 1 but I still received the same error as above, or this error RuntimeError: CUDA error: device-side assert triggered

1 Like

Thanks for the update. Could you post an executable code snippet to reproduce this issue as well as the system information (PyTorch, CUDA, cudnn versions and the used GPUs)?

I’m getting the same problem trying to run BERT. any solution please ??
4 GPUs
pytorch = 1.6.0
cuda : 10.2

the code I’m trying to run

Are you seeing the same issue after upgrading PyTorch to the latest stable release (1.7.1) and/or reducing the batch size?

displays the same error. Just to mention the code work well with CPU

Thanks for the update. Which GPUs, CUDA, cudnn, NVIDIA driver, and PyTorch versions are you using?
Also, which dataset and script from the repository are you calling?