Scaled Dot Product Attention Algorithm Selection—results don't match documentation/errors

chocolate_fireball · March 6, 2024, 12:29pm

In https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html, I see there are three scaled dot product attention algorithms.

I am facing an interesting scenario where the kernels for flash and memory-efficient attention aren’t installed, but I get different results when I use

with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False):
    ...

and

with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=True, enable_mem_efficient=True):
    ...

Specifically, the difference in results between running with and without torch.use_deterministic_algorithms() set and all sources of randomness accounted for is zero in the first instance (flash and memory-efficient disabled) and the second (all attention algorithms enabled). But as far as I know, there is only one possible implementation on my system. Is there any explanation for this?

Would massively appreciate help with this; it might end a five month debugging saga!

ptrblck · March 6, 2024, 11:11pm

I don’t fully understand this statement since PyTorch itself ships with the needed kernels.

chocolate_fireball · March 7, 2024, 1:35am

Thanks for your reply! It appears they are all installed; I hadn’t fully read the error message, eg. when running memory-efficient attention with determinism.