Hi, I am a bit confused about the way to implement a causal attention mask in Transformer, specifically, should “-inf” be placed in the upper or lower triangular part of the attention matrix.

In PyTorch’s recommanded way to obtain the attention mask, the “-inf” are placed on upper triangular part:

```
import torch.nn as nn
model = nn.Transformer()
mask = model.generate_square_subsequent_mask(sz=4)
print(mask)
```

Outputs:

```
tensor([[0., -inf, -inf, -inf],
[0., 0., -inf, -inf],
[0., 0., 0., -inf],
[0., 0., 0., 0.]])
```

However, in PyTorch’s implementation of torch.nn.MultiHeadAttention, the documentation on `attn_mask`

is:

attn_mask– If specified, a 2D or 3D mask preventing attention to certain positions. Must be of shape (𝐿,𝑆) or (𝑁⋅num_heads,𝐿,𝑆), where 𝑁 is the batch size, 𝐿 is the target sequence length, and 𝑆 is the source sequence length.

Here I am assuming that “source sequence length” is the length of input tokens, and “target sequence length” is the length of output tokens. Since we do not want the model to attent to future outputs, then I think the “-inf” should be placed on the lower triangular part of the attention matrix? I am getting very confused on this.