RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

I am using the following code to use lora to finetune Llama-7B

import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer
from peft import LoraConfig, TaskType, get_peft_model

ds = Dataset.load_from_disk("../data/alpaca_data_zh/")
tokenizer = AutoTokenizer.from_pretrained("../model/Llama-2-7b-ms")
def precess_func:
   ... # process data
tokenized_ds = ds.map(process_func, remove_columns=ds.column_names)

model = AutoModelForCausalLM.from_pretrained("../model/Llama-2-7b-ms", low_cpu_mem_usage=True, 
                                             torch_dtype=torch.half, device_map="auto")
config = LoraConfig(task_type=TaskType.CAUSAL_LM)
model = get_peft_model(model, config)
model.enable_input_require_grads()
args = TrainingArguments(
    output_dir="./chatbot",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    logging_steps=10,
    num_train_epochs=1,
    gradient_checkpointing=True
)
trainer = Trainer(
    model=model,
    args=args,
    tokenizer=tokenizer,
    train_dataset=tokenized_ds.select(range(6000)),
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
)
trainer.train()

every thing goes well except trainer.train(), it reports the following warning

/home/wtx/miniconda3/envs/llm/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: warning: libstdc++.so.6, needed by /home/wtx/.local/cuda-11.8/lib64/libcufile.so, not found (try using -rpath or -rpath-link)
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: warning: libm.so.6, needed by /home/wtx/.local/cuda-11.8/lib64/libcufile.so, not found (try using -rpath or -rpath-link)

I have tried to add /usr/lib/x86_64-linux-gnu which contains libstdc++.so.6 and libm.so.6 to $LD_LIBRARY_PATH.But It still can’t find them and the training reports the following error:

Traceback (most recent call last):
  File "/home/wtx/workspace/python_project/LLM/Transformers/train.py", line 154, in <module>
    trainer.train()
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 3349, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 2196, in backward
    loss.backward(**kwargs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.

I’d appreciate it if someone could give me some advice.

Here are my library version, please tell me if you need more information:

  • os: ubuntu 22.04
  • pytorch version: 2.1.0
  • cuda: 11.8
  • accelerate: 0.34.2
  • transformers: 4.44.2

The full log is here:

/home/wtx/miniconda3/envs/llm/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: warning: libstdc++.so.6, needed by /home/wtx/.local/cuda-11.8/lib64/libcufile.so, not found (try using -rpath or -rpath-link)
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: warning: libm.so.6, needed by /home/wtx/.local/cuda-11.8/lib64/libcufile.so, not found (try using -rpath or -rpath-link)
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `std::runtime_error::~runtime_error()@GLIBCXX_3.4'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `__gxx_personality_v0@CXXABI_1.3'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `std::ostream::tellp()@GLIBCXX_3.4'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `std::string::substr(unsigned long, unsigned long) const@GLIBCXX_3.4'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `std::string::_M_replace_aux(unsigned long, unsigned long, unsigned long, char)@GLIBCXX_3.4'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `dlopen'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `typeinfo for bool@CXXABI_1.3'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `std::__throw_logic_error(char const*)@GLIBCXX_3.4'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `VTT for std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >@GLIBCXX_3.4'
... # similar output
collect2: error: ld returned 1 exit status

Traceback (most recent call last):
  File "/home/wtx/workspace/python_project/LLM/Transformers/train.py", line 154, in <module>
    trainer.train()
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 3349, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 2196, in backward
    loss.backward(**kwargs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
... # similar output

I find remove device_map='auto' fix the issue