In https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html, I see there are three scaled dot product attention algorithms.
I am facing an interesting scenario where the kernels for flash and memory-efficient attention aren’t installed, but I get different results when I use
with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False):
...
and
with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=True, enable_mem_efficient=True):
...
Specifically, the difference in results between running with and without torch.use_deterministic_algorithms()
set and all sources of randomness accounted for is zero in the first instance (flash and memory-efficient disabled) and the second (all attention algorithms enabled). But as far as I know, there is only one possible implementation on my system. Is there any explanation for this?
Would massively appreciate help with this; it might end a five month debugging saga!