Is scaled_dot_product_attention included in TransformerEncoder

yuzhenmao · November 2, 2023, 8:45pm

Hi,

From (Beta) Implementing High-Performance Transformers with Scaled Dot Product Attention (SDPA) — PyTorch Tutorials 2.2.0+cu121 documentation it mentions that “This function (scaled_dot_product_attention) has already been incorporated into torch.nn.MultiheadAttention and torch.nn.TransformerEncoderLayer”. However, in the doc of TransformerEncoder, it lists several constraints that have to be satisfied if we want to use flash-atten (It seems that TransformerEncoder can only use flash-atten during the inference time):

But scaled_dot_product_attention does not mention such constraints in its doc. Is this inconsistency correct, or I misunderstand something? Thanks!