Scaled Dot Product Attention Algorithm Selection—results don't match documentation/errors

In, I see there are three scaled dot product attention algorithms.

I am facing an interesting scenario where the kernels for flash and memory-efficient attention aren’t installed, but I get different results when I use

with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False):


with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=True, enable_mem_efficient=True):

Specifically, the difference in results between running with and without torch.use_deterministic_algorithms() set and all sources of randomness accounted for is zero in the first instance (flash and memory-efficient disabled) and the second (all attention algorithms enabled). But as far as I know, there is only one possible implementation on my system. Is there any explanation for this?

Would massively appreciate help with this; it might end a five month debugging saga!

I don’t fully understand this statement since PyTorch itself ships with the needed kernels.

Thanks for your reply! It appears they are all installed; I hadn’t fully read the error message, eg. when running memory-efficient attention with determinism.