ValueError: Cannot find backend for cpu in flash_attn/ops/triton/rotary.py

Issue Description

Problem

Executing the provided code results in a ValueError: Cannot find backend for cpu originating from flash_attn/ops/triton/rotary.py.

Context

The code employs the flash_attn library in conjunction with a pretrained model for text generation. The specific error occurs during the application of rotary transformations.

Steps to Reproduce

  1. Import necessary libraries:

    import transformers
    import torch
    import time
    import os
    from transformers import AutoTokenizer, AutoModelForCausalLM
    from peft import PeftModel
    
  2. Define base model name and new model path:

    base_model_name = 'NousResearch/Yarn-Llama-2-7b-128k'
    new_model_path = '/opt/llama_models/testing'
    
  3. Load the base model:

    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        trust_remote_code=True
    )
    
  4. Initialize a PeftModel using the base model and a custom path:

    model = PeftModel.from_pretrained(base_model, new_model_path)
    
  5. Merge and unload the model:

    model = model.merge_and_unload()
    
  6. Initialize a tokenizer:

    tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    tokenizer.padding_side = "right"
    tokenizer.add_eos_token = True
    
  7. Set up a text generation pipeline:

    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        device_map="auto"
    )
    
  8. Provide a prompt and specify generation parameters:

    prompt = "something here".strip()
    max_new_tokens = 1
    
  9. Generate text:

    start_time = time.time()
    sequences = pipeline(
        prompt,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_new_tokens=max_new_tokens
    )
    elapsed_time = time.time() - start_time
    

Expected Behavior

The code should generate text using the specified prompt and parameters without encountering any errors.

Actual Behavior

The code raises a ValueError: Cannot find backend for cpu during the rotary transformation step.

Full Error:

...
/usr/local/lib/python3.8/dist-packages/flash_attn/layers/rotary.py in forward(ctx, x, cos, sin, interleaved, inplace, seqlen_offsets, cu_seqlens, max_seqlen)
     46         max_seqlen: Optional[int] = None,
     47     ):
---> 48         out = apply_rotary(
     49             x,
     50             cos,

/usr/local/lib/python3.8/dist-packages/flash_attn/ops/triton/rotary.py in apply_rotary(x, cos, sin, seqlen_offsets, cu_seqlens, max_seqlen, interleaved, inplace, conjugate)
    211     # ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
    212     with torch.cuda.device(x.device.index):
--> 213         rotary_kernel[grid](
    214             output,  # data ptrs
    215             x,

<string> in rotary_kernel(OUT, X, COS, SIN, CU_SEQLENS, SEQLEN_OFFSETS, seqlen, nheads, rotary_dim, seqlen_ro, CACHE_KEY_SEQLEN, stride_out_batch, stride_out_seqlen, stride_out_nheads, stride_out_headdim, stride_x_batch, stride_x_seqlen, stride_x_nheads, stride_x_headdim, BLOCK_K, IS_SEQLEN_OFFSETS_TENSOR, IS_VARLEN, INTERLEAVED, CONJUGATE, BLOCK_M, grid, num_warps, num_stages, extern_libs, stream, warmup, device, device_type)

ValueError: Cannot find backend for cpu

Environment

  • Python Version: 3.8.10
  • PyTorch Version: 2.0.1+cu118
  • Transformers Version: 4.33.0.dev0
  • flash-attn Version: 2.2.2
  • Triton Version: 2.1.0

Additional Information

  • Before this error I was encountering this error but it got solved after upgrading Triton to 2.1.0 but right after that I am stuck with this error.

You are explicitly using the GPU via:

with torch.cuda.device

so did you check if flash_attn supports CPU-only workloads?