Using F.scaled_dot_product_attention gives the error RuntimeError: No available kernel. Aborting execution

sree_harsha · May 28, 2023, 5:07pm

Hello,

I’m trying to run the ‘FlashAttention’ variant of the F.scaled_dot_product_attention.
Config = namedtuple(‘FlashAttentionConfig’, [‘enable_flash’, ‘enable_math’, ‘enable_mem_efficient’])’
self.cuda_config = Config(True, False, False)
with torch.backends.cuda.sdp_kernel(**self.cuda_config._asdict()):
x = F.scaled_dot_product_attention(q, k, v)

I am on A100-SXM,
Tried running this with

CUDA version 12.0 and PyTorch 2.1.0.dev20230526+cu121
CUDA 11.7 and PyTorch 2.0.1

I see no references to this error in general, and not sure what I’m doing wrong?

It works just fine with
Config(False, True, True) → which uses math and memory efficient attention. But I would preferably use Flash Attention.

Would appreciate any help in this context!

ptrblck · May 28, 2023, 10:37pm

Could you check if you are seeing the same issue in the recent nightly binary, please?

sree_harsha · May 28, 2023, 10:59pm

Thank you, its working now. It says ‘reduction over non-contiguous data’ frequently but it seems to be working.

If you don’t mind I have two follow up questions:

Does it hold that the Flash Attention implementation available in PyTorch is only usable with float16/bfloat16 like the original repo’s implementation? Or would it work with Float32 as well?
Is this fully compatible with torch.compile?

For those interested, this error no longer occurs after updating to PyTorch nightly version 2.1.0.dev20230527+cu121.

kyegomez · May 31, 2023, 2:16am

How you can pip install the nightly release? 2.1.0.dev20230527+cu121

sree_harsha · June 1, 2023, 3:52pm

Just install the latest nightly

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

trusira · July 2, 2024, 8:28pm

I get the same error with Pytorch 2.3.0+cu121 built from source. What could be the possible reason?
I’m running on a cluster of Tesla V100-SXM2-32GB with CUDA version: 12.4.

with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
    attn_out = F.scaled_dot_product_attention(xq, keys, values)

ptrblck · July 2, 2024, 10:38pm

Based in this code Volta GPUs are not supported.
I would generally not recommend trying to force a specific algorithm, but let PyTorch select the fastest one for the used device.

Varun_Gumma · July 8, 2024, 1:39pm

I get the same error with Pytorch 2.3.0, with A100. I read somewhere that flash_attn does not support attn_mask. Could you please help me here?

ptrblck · July 8, 2024, 1:58pm

Are you also trying to call a specific algorithm explicitly? Does it work, if you let PyTorch pick the right one?

Varun_Gumma · July 9, 2024, 10:34am

Yes, it works if I get Pytorch choose the best algorithm. But flash attention alone does seem to work as it does not support an attention mask separately.