In some circumstances when given tensors on a CUDA device and using CuDNN, this operator may select a nondeterministic algorithm to increase performance. If this is undesirable, you can try to make the operation deterministic (potentially at a performance cost) by setting
torch.backends.cudnn.deterministic = True. See Reproducibility for more information.
We saw forward is mentioned in the doc. But does scaled_dot_product_attention’s backward support reproduce when choose flash_attention as backend?
It seems that atomicAdd is used in the backend of Flashattention 2, which does not seem to guarantee determinism. Is this the case?
But I repeated the backward steps time and time again, as if the output was deterministic ![]()