That’s a good observation.
PyTorch recommends setting need_weights=False (the default is True) for improved performance[source]. When setting this to need_weights=False and Keep_padding_mask is None , the subcase is executed
In this case, as noted in the comments, PyTorch uses a more efficient implementation of scaled dot product attention (SDPA) by fusing multiple operations.
This efficient implementation requires attention_mask=None (even though masking is still handled internally). Therefore, even if one passes the attention_mask manually, it gets overridden.
Why not always use this fused version, then?
Likely because fusion prevents returning the attention weights. For more details, refer to FlashAttention.
Hope it helps!