Explicitly forcing torch's MHA to use Flash Attention

neel_g · March 21, 2023, 6:06pm

How can we force torch to use new SDPA implementation in torch.nn.multiheadattention()?

I know that it is supposed to use it automatically, but I’d still prefer an explicit version.
There’s also a ctxmanager, but I’m unsure where its supposed to be wrapped. Using it in the foward() doesn’t change anything, nor does simply stating:

torch.backends.cuda.enable_flash_sdp(enabled=True)

somewhere in the code.

soulitzer · March 21, 2023, 6:16pm

Curious why you want to do this since I’d imagine that it would produce incorrect results?

neel_g · March 21, 2023, 9:37pm

Why would that be? The new SPDA doesn’t really tradeoff precision AFAIK.

soulitzer · March 22, 2023, 6:59pm

I don’t know the specifics, but I’d imagine that the speed ups are enabled by making specific assumptions about the inputs, and if those assumptions are violated, correctness may not be guaranteed. If you have a use case that you’d think should be supported, but is not currently supported, maybe you can file an issue?

neel_g · March 22, 2023, 7:14pm

Doesn’t seem worth it tbh. maybe @ptrblck has some idea what I should do?