CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

BIGWangYuDong · June 30, 2021, 9:30am

CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
I’m trying to use torch.mm() but sometimes I got this error.

I run the code under Pytorch 1.4 with CUDA 9.0
Meanwhile, I also run the code with Pytorch 1.5, CUDA 9.0. And sometimes, I also get this bug.
More details in #180

ptrblck · June 30, 2021, 9:35am

This error might be raised, if you are running out of memory and cublas fails to create the handle, so try to reduce the memory usage e.g. via a smaller batch size.

SirAlexI · July 13, 2021, 8:33am

Hi, I get the same error.
Everything works fine when I use only one gpu. But, if I use two, I get the same error. And, I reduced the size of my problem so it runs and finishes in one cpu swiftly.

File “/home/server/Escritorio/AlejandroF/agfa/simple-transformers/train_language_model.py”, line 138, in
model.train_model(train_filename)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/simpletransformers/language_modeling/language_modeling_model.py”, line 431, in train_model
global_step, training_details = self.train(
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/simpletransformers/language_modeling/language_modeling_model.py”, line 774, in train
model(inputs, labels=labels)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py”, line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py”, line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py”, line 86, in parallel_apply
output.reraise()
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/_utils.py”, line 428, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py”, line 61, in _worker
output = module(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py”, line 1329, in forward
outputs = self.bert(
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py”, line 991, in forward
encoder_outputs = self.encoder(
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py”, line 582, in forward
layer_outputs = layer_module(
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py”, line 470, in forward
self_attention_outputs = self.attention(
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py”, line 401, in forward
self_outputs = self.self(
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py”, line 267, in forward
mixed_query_layer = self.query(hidden_states)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/linear.py”, line 93, in forward
return F.linear(input, self.weight, self.bias)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/functional.py”, line 1692, in linear
output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

SirAlexI · July 13, 2021, 8:34am

I meant " in one gpu swiftly."

ptrblck · July 15, 2021, 6:07am

nn.DataParallel (which seems to be used in your use case) could create an imbalanced memory usage and could thus cause an OOM on the default device, which is why we recommend to use DistributedDataParallel with a single process per GPU instead.
I assume you are also running OOM, so I would recommend to try to apply DDP.

SirAlexI · July 15, 2021, 11:27am

Thanks for your answer.

BIGWangYuDong · March 8, 2022, 9:27am

Hi, thanks for your reply. Is there any way that I can save the memories? I have tried to use torch.einsum but seems this cannot save the memories. Any Idea?

Here is an simple example:

# old version
inter_matrix = torch.mm(flatten_masks, flatten_masks.transpose(1, 0))
# new version
inter_matrix = torch.einsum('ik, kj -> ij', flatten_masks, flatten_masks.transpose(1, 0))

ptrblck · March 8, 2022, 7:35pm

I’m not sure what your use case is. Would you like to lower the memory usage on your GPU?

BIGWangYuDong · March 9, 2022, 3:46am

Yep, here is a script that I use to check the GPU memory and running time

import torch
import numpy as np
import time

flatten_masks = np.random.random((800, 60800))

flatten_masks = torch.from_numpy(flatten_masks).cuda(device=0)
print()
t1 = time.time()
i = 0
while i < 2500:
    if i == 500:
        t1 = time.time()
    # old version 
    inter_matrix = torch.mm(flatten_masks, flatten_masks.transpose(1, 0))
    # new version
    # inter_matrix = torch.einsum('ik, kj -> ij', flatten_masks, flatten_masks.transpose(1, 0))
    i += 1
t2 = time.time()
print(t2-t1)

result:

	torch.mm	torch.einsum
process time (s)	4.69856	4.5713
GPU memory (MiB)	807	807

And my question is what code I can use that I can save the memory?

Niraj · April 10, 2023, 4:18pm

Hello (conituing on this thread), can you provide me with some guidance:
I am facing a similar error when i try to call on my encoder (of the Transformer block), Error: …
→ 4765 proj = linear(q, w, b)
4766 # reshape to 3, E and not E, 3 is deliberate for better memory coalescing and keeping same order as chunk()
4767 proj = proj.unflatten(-1, (3, E)).unsqueeze(0).transpose(0, -2).squeeze(-2).contiguous()

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

src = self.encoder(
            src=src
            ).to(device)
################################################
encoder_layer = nn.TransformerEncoderLayer(
            d_model=dim_val, 
            nhead=n_heads,
            dim_feedforward=dim_feedforward_encoder,
            dropout=dropout_encoder,
            batch_first=batch_first
            ).to(device)

        # Stack the encoder layers in nn.TransformerDecoder
        self.encoder = nn.TransformerEncoder(encoder_layer=encoder_layer,num_layers=n_encoder_layers, norm=True).to(device)

Output:
cuda 1 From model.forward(): Size of src as given to forward(): torch.Size([10, 20, 4]) From model.forward(): tgt size = torch.Size([10, 20, 2]) From model.forward(): Size of src after input layer: torch.Size([10, 20, 2204]) step x shape: torch.Size([10, 20, 2204]) Pe_map shape: torch.Size([10, 1, 2204]) From model.forward(): Size of src after pos_enc layer: torch.Size([10, 20, 2204])

Output exceeds the size limit. Open the full output data in a text editor

--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last)…

Why this error and how do i go about handling this error?

ptrblck · April 10, 2023, 10:47pm

You are also most likely running out of memory which causes the cublasHandle init error as it tries to allocate a workspace. I would recommend reducing the memory usage e.g. by lowering the batch size and check if this fixes the error.

Nidhi_Rustagi · May 10, 2023, 6:04am

Hi! I an exploring Accelerate library form HuggingFace to fine tune BertForSequenceClassification. Relevant code snippets

import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
from accelerate import Accelerator

created Dataset class and Dataloaders
def get_dataloaders():
    train_dl = DataLoader(train_ds, shuffle=True, pin_memory=True)
    val_dl = DataLoader(val_ds, shuffle=False, pin_memory=True)
    
    return train_dl, val_dl

def training_loop():
    filename = "models/bert_category_clsfr"
    
    accelerator = Accelerator()
    device = accelerator.device

    model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
    optimizer = torch.optim.AdamW(model.parameters())
    train_dl, val_dl = get_dataloaders()
    
    model, optimizer, train_dl = accelerator.prepare(model, optimizer, train_dl)
    val_dl = accelerator.prepare(val_dl)
    
    for epoch in range(1):
        model.train()
        for step,batch in enumerate(train_dl):
            input_ids = batch["input_ids"].to(device)
            targets = batch["label"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            optimizer.zero_grad()

            output = model(input_ids, attention_mask)
            loss = F.cross_entropy(output.logits, targets)

            accelerator.backward(loss)
            optimizer.step()

     accelerator.wait_for_everyone()
     unwrapped_model = accelerator.unwrap_model(model)
     accelerator.save(unwrapped_model.state_dict(), filename)

from accelerate import notebook_launcher
notebook_launcher(training_loop, num_processes=4)

Results in the same error as above:
ProcessRaisedException:

– Process 0 terminated with the following error:
Traceback (most recent call last):
File “/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line
69, in _wrap
fn(i, *args)
File “/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/accelerate/utils/launch.py”, line
509, in call
self.launcher(*args)
File “/tmp/ipykernel_73883/403198274.py”, line 25, in training_loop
accelerator.backward(loss)
File “/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/accelerate/accelerator.py”, line
1683, in backward
loss.backward(**kwargs)
File “/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/_tensor.py”, line 488, in
backward
torch.autograd.backward(
File “/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/autograd/init.py”, line
197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

batch size is 4 for 4 GPUs so I dont think batch size is an issue.
Please share what else can be explored to use Accelerate for distributed training

ptrblck · May 10, 2023, 6:36am

Could you explain why not and how you’ve verified this batch size should fit into your GPUs?

Nidhi_Rustagi · May 10, 2023, 7:24am

As per my understanding a batch size of 4 on 4 processors would mean, a batch size of 4 datapoints is executed on single GPU.
If I use the script without Accelerate() and with Huggingface Trainer API, I can run the script successfully with batch size of 16, hence I thought batch size of 4 should not be a problem. In case there is another way to verify that given batch size should fit into GPU please let me know, I will be happy to verify it.

Resolved the error, my problem was not specifying num_labels while loading the model
Changing
BertForSequenceClassification.from_pretrained(“bert-base-uncased”)

to
BertForSequenceClassification.from_pretrained("bert-base-uncased",num_labels=NUM_LABELS)
help in resolving the issue. ALso added torch.cuda.empty_cache() before starting the fine tuning process.

zikai · July 7, 2024, 6:45am

Ｔｈａｎｋｙｏｕ，ｂｙｓｅｔｔｉｎｇ
os.environ[‘CUDA_VISIBLE_DEVICES’] = “4”
ｏｎｏｎｅｇｐｕ，ｉｔｆｉｎｅｓ．

Abhishek_Ghosh · October 10, 2024, 6:45am

I am using torch.compile for the fastNLP_Bert model.

The logs show the following:

V1010 11:47:33.109000 130925214240768 torch/_inductor/graph.py:1500] [5/0] Output code written to: /tmp/torchinductor_abhishek/xy/cxymc6c7zr6k5kpcynp3sgouehmub6kqljk3rhigwfluxnwqmiqp.py
V1010 11:47:33.110000 130925214240768 torch/_inductor/compile_fx.py:481] [5/0] FX codegen and compilation took 4.754s

When I try to execute the output code’s python script:

/tmp/torchinductor_abhishek/xy$ python3 cxymc6c7zr6k5kpcynp3sgouehmub6kqljk3rhigwfluxnwqmiqp.py 
Traceback (most recent call last):
  File "/tmp/torchinductor_abhishek/xy/cxymc6c7zr6k5kpcynp3sgouehmub6kqljk3rhigwfluxnwqmiqp.py", line 1718, in <module>
    compiled_module_main('fastNLP_Bert', benchmark_compiled_module)
  File "/home/abhishek/pytorch-benchmarks/lib/python3.10/site-packages/torch/_inductor/wrapper_benchmark.py", line 283, in compiled_module_main
    wall_time_ms = benchmark_compiled_module_fn(times=times, repeat=repeat) * 1000
  File "/tmp/torchinductor_abhishek/xy/cxymc6c7zr6k5kpcynp3sgouehmub6kqljk3rhigwfluxnwqmiqp.py", line 1713, in benchmark_compiled_module
    return print_performance(fn, times=times, repeat=repeat)
  File "/home/abhishek/pytorch-benchmarks/lib/python3.10/site-packages/torch/_inductor/utils.py", line 331, in print_performance
    timings = torch.tensor([timed(fn, args, times, device) for _ in range(repeat)])
  File "/home/abhishek/pytorch-benchmarks/lib/python3.10/site-packages/torch/_inductor/utils.py", line 331, in <listcomp>
    timings = torch.tensor([timed(fn, args, times, device) for _ in range(repeat)])
  File "/home/abhishek/pytorch-benchmarks/lib/python3.10/site-packages/torch/_inductor/utils.py", line 320, in timed
    result = model(*example_inputs)
  File "/tmp/torchinductor_abhishek/xy/cxymc6c7zr6k5kpcynp3sgouehmub6kqljk3rhigwfluxnwqmiqp.py", line 1712, in <lambda>
    fn = lambda: call([arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1, arg50_1, arg51_1, arg52_1, arg53_1, arg54_1, arg55_1, arg56_1, arg57_1, arg58_1, arg59_1, arg60_1, arg61_1, arg62_1, arg63_1, arg64_1, arg65_1, arg66_1, arg67_1, arg68_1, arg69_1, arg70_1, arg71_1, arg72_1, arg73_1, arg74_1, arg75_1, arg76_1, arg77_1, arg78_1, arg79_1, arg80_1, arg81_1, arg82_1, arg83_1, arg84_1, arg85_1, arg86_1, arg87_1, arg88_1, arg89_1, arg90_1, arg91_1, arg92_1, arg93_1, arg94_1, arg95_1, arg96_1, arg97_1, arg98_1, arg99_1, arg100_1, arg101_1, arg102_1, arg103_1, arg104_1, arg105_1, arg106_1, arg107_1, arg108_1, arg109_1, arg110_1, arg111_1, arg112_1, arg113_1, arg114_1, arg115_1, arg116_1, arg117_1, arg118_1, arg119_1, arg120_1, arg121_1, arg122_1, arg123_1, arg124_1, arg125_1, arg126_1, arg127_1, arg128_1, arg129_1, arg130_1, arg131_1, arg132_1, arg133_1, arg134_1, arg135_1, arg136_1, arg137_1, arg138_1, arg139_1, arg140_1, arg141_1, arg142_1, arg143_1, arg144_1, arg145_1, arg146_1, arg147_1, arg148_1, arg149_1, arg150_1, arg151_1, arg152_1, arg153_1, arg154_1, arg155_1, arg156_1, arg157_1, arg158_1, arg159_1, arg160_1, arg161_1, arg162_1, arg163_1, arg164_1, arg165_1, arg166_1, arg167_1, arg168_1, arg169_1, arg170_1, arg171_1, arg172_1, arg173_1, arg174_1, arg175_1, arg176_1, arg177_1, arg178_1, arg179_1, arg180_1, arg181_1, arg182_1, arg183_1, arg184_1, arg185_1, arg186_1, arg187_1, arg188_1, arg189_1, arg190_1, arg191_1, arg192_1, arg193_1, arg194_1, arg195_1, arg196_1, arg197_1, arg198_1, arg199_1, arg200_1, arg201_1])
  File "/tmp/torchinductor_abhishek/xy/cxymc6c7zr6k5kpcynp3sgouehmub6kqljk3rhigwfluxnwqmiqp.py", line 722, in call
    extern_kernels.mm(reinterpret_tensor(buf4, (486400, 768), (768, 1), 0), reinterpret_tensor(arg5_1, (768, 768), (1, 768), 0), out=buf5)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
(pytorch-benchmarks) abhishek@chisel-7:/tmp/torchinductor_abhishek/xy$

I get the above error. I guess the output code be run standalone (I have tried out with other output codes without having extern_kernels call and they work)

Note: The full torch.compiled application work fine.