From (Beta) Implementing High-Performance Transformers with Scaled Dot Product Attention (SDPA) — PyTorch Tutorials 2.1.0+cu121 documentation it mentions that “This function (scaled_dot_product_attention) has already been incorporated into
torch.nn.TransformerEncoderLayer”. However, in the doc of TransformerEncoder, it lists several constraints that have to be satisfied if we want to use flash-atten (It seems that TransformerEncoder can only use flash-atten during the inference time):
But scaled_dot_product_attention does not mention such constraints in its doc. Is this inconsistency correct, or I misunderstand something? Thanks!