Should Transformer's causal attention mask be upper-triangular or lower-triangular?

cloudydory · June 27, 2024, 9:33am

Hi, I am a bit confused about the way to implement a causal attention mask in Transformer, specifically, should “-inf” be placed in the upper or lower triangular part of the attention matrix.

In PyTorch’s recommanded way to obtain the attention mask, the “-inf” are placed on upper triangular part:

import torch.nn as nn
model = nn.Transformer()
mask = model.generate_square_subsequent_mask(sz=4)
print(mask)

Outputs:

tensor([[0., -inf, -inf, -inf],
        [0., 0., -inf, -inf],
        [0., 0., 0., -inf],
        [0., 0., 0., 0.]])

However, in PyTorch’s implementation of torch.nn.MultiHeadAttention, the documentation on attn_mask is:

attn_mask – If specified, a 2D or 3D mask preventing attention to certain positions. Must be of shape (𝐿,𝑆) or (𝑁⋅num_heads,𝐿,𝑆), where 𝑁 is the batch size, 𝐿 is the target sequence length, and 𝑆 is the source sequence length.

Here I am assuming that “source sequence length” is the length of input tokens, and “target sequence length” is the length of output tokens. Since we do not want the model to attent to future outputs, then I think the “-inf” should be placed on the lower triangular part of the attention matrix? I am getting very confused on this.

Brezen · June 27, 2024, 4:08pm

The causal mask is with “-inf” in the upper triangular section. Attn_mask is needed if you want to implement a generic mask. You can use is_causal = True to add the causal mask.
Anyway, I think the process is better described in the scaled dot product documentation.