Causal masking hint

Arunprakash-A · August 30, 2025, 4:43pm

That’s a good observation.

PyTorch recommends setting need_weights=False (the default is True) for improved performance[source]. When setting this to need_weights=False and Keep_padding_mask is None , the subcase is executed

In this case, as noted in the comments, PyTorch uses a more efficient implementation of scaled dot product attention (SDPA) by fusing multiple operations.

This efficient implementation requires attention_mask=None (even though masking is still handled internally). Therefore, even if one passes the attention_mask manually, it gets overridden.

Why not always use this fused version, then?
Likely because fusion prevents returning the attention weights. For more details, refer to FlashAttention.

Hope it helps!