Does scaled_dot_product_attention's backward support reproduce

brosoul · December 10, 2025, 11:14am

In some circumstances when given tensors on a CUDA device and using CuDNN, this operator may select a nondeterministic algorithm to increase performance. If this is undesirable, you can try to make the operation deterministic (potentially at a performance cost) by setting torch.backends.cudnn.deterministic = True. See Reproducibility for more information.

We saw forward is mentioned in the doc. But does scaled_dot_product_attention’s backward support reproduce when choose flash_attention as backend?

It seems that atomicAdd is used in the backend of Flashattention 2, which does not seem to guarantee determinism. Is this the case?

But I repeated the backward steps time and time again, as if the output was deterministic