Understanding mask size in Transformer Example


i am trying to understand the Transformer architecture, following one of the pytorch examples at (Language Modeling with nn.Transformer and TorchText — PyTorch Tutorials 1.11.0+cu102 documentation)

I have troubles thought to understand the dimension/shape of the mask that is used to limit the self-attention to sequence elements before the “current” token.

In the example, the mask size is [batch_size, batch_size]. I would have thought it would be something like [sequence_length, sequence_length]. So for each position in the sequence, there is a separate mask that indicates what other tokens the self-attention mechanism can “access”.

Running the code, I see that the additiative mask has the shape [BS, BS] with the content

tensor([[0., -inf, -inf, -inf],
        [0., 0., -inf, -inf],
        [0., 0., 0., -inf],
        [0., 0., 0., 0.]])

using a BS of 4.

Can someone maybe clarify this for me? Are different masks applied for the different batches? And even more general, if I have a BS of 4, and sequences of length 6 for example, during training is the model not supposed to learn the probability of the N+1 sequence token based on all previous N tokens? So for each input token position there exists its own mask vector?

thank you for the clarification,