Backward pass of scaled_dot_product_attention fails on H100

I came into the same problem when using loss.backward(), when I swithed torch to Preview(Nightly) version, it worked.