I am trying to map my understand of the masks used in TransformerDecoderLayer
to that of huggingface where attention_mask
is used. Suppose I have the following model (and data).
decoder_layer = nn.TransformerDecoderLayer(d_model=64, nhead=8, batch_first=True)
memory = torch.rand(4, 5, 64)
tgt = torch.rand(4, 3, 64)
out = decoder_layer(tgt, memory)
In the context of a translation task tgt
are the embeddings of the translated (output) sentence, and memory
is the output of the encoder model. Given that input’s outputs can have different lengths in the sentences, suppose I have the following attention_mask
s.
tgt_attention_mask = torch.LongTensor(
[
[1, 1, 1, 1, 1],
[1, 1, 1, 0, 0],
[1, 0, 0, 0, 0],
[1, 1, 0, 0, 0],
]
)
memory_attention_mask = torch.LongTensor(
[
[1, 1, 1],
[1, 1, 1],
[1, 0, 0],
[1, 0, 0],
]
)
Importantly, notice how the sequence length of the tgt
is 5, while the memory
is of length 3.
From my understanding of decoder transformers, there are two types of attention matrices. 1. Self attention matrix of size (batch_size, 5, 5), 2. Cross attention matrix of size (batch_size, 5, 3). The question is how do I pass these attention masks into the mask
arguments in here. There are 4 mask arguments in TransformerDecoderLayer
and I’m not sure how these came to be.
Thanks in advance.
Aside
Incase you need it the cross attention matrix would look like this:
cross_attention_mask = tgt_attention_mask.unsqueeze(-1) @ memory_attention_mask.unsqueeze(1)