ValueError: Cannot find backend for cpu in flash_attn/ops/triton/rotary.py

You are explicitly using the GPU via:

with torch.cuda.device

so did you check if flash_attn supports CPU-only workloads?