Confused about the way self-attention is calculated in nn.Transformer

meteorlin · December 6, 2022, 11:59am

I noticed that nn.TransformerEncoderLayer and nn.TransformerDecoderLayer will take the first dimention when calculating self-attention. Does it have any specific meaning? According to my understanding, shouldn’t it be to keep all the dimentions?
In addition, assuming that this operation has practical significance, when batch_first is True or False , the meanings of the dimention obtained by [0] should be different (indicating batch and seq respectively), so is there any ambiguity?

pytorch/pytorch/blob/a6d72f44a4e8b6e9d2e878f30fd8b1d3e1197f0e/torch/nn/modules/transformer.py#L544


      
                      x = self.norm2(x + self._ff_block(x))
          
          
        return x
          
          
    # self-attention block
              def _sa_block(self, x: Tensor,
                            attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor]) -> Tensor:
                  x = self.self_attn(x, x, x,
                                     attn_mask=attn_mask,
                                     key_padding_mask=key_padding_mask,
                                     need_weights=False)[0]
                  return self.dropout1(x)
          
          
    # feed forward block
              def _ff_block(self, x: Tensor) -> Tensor:
                  x = self.linear2(self.dropout(self.activation(self.linear1(x))))
                  return self.dropout2(x)
          
          

          
class TransformerDecoderLayer(Module):
              r"""TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.