Modify scaled_dot_product_attention signature

Hi,
I want to implement Multihead Self Attention with a relative timestamp embedding as described here.

I’m not sure if I’m misunderstanding the documentation for sdpa here, but it seems the attn_mask could be used to inject attn bias as well. Perhaps, it would be a cleaner approach if we instead had separate parameters, or optionally, returned the attn_weights in addition to attn_weights @ value to allow for more flexible usage of the function.

What do folks think about this? Happy to contribute this change if it makes sense.