Confused about the way self-attention is calculated in nn.Transformer

  1. I noticed that nn.TransformerEncoderLayer and nn.TransformerDecoderLayer will take the first dimention when calculating self-attention. Does it have any specific meaning? According to my understanding, shouldn’t it be to keep all the dimentions?
  2. In addition, assuming that this operation has practical significance, when batch_first is True or False , the meanings of the dimention obtained by [0] should be different (indicating batch and seq respectively), so is there any ambiguity?