Mask meaning in TransformerDecoderLayer

I am trying to map my understand of the masks used in TransformerDecoderLayer to that of huggingface where attention_mask is used. Suppose I have the following model (and data).

decoder_layer = nn.TransformerDecoderLayer(d_model=64, nhead=8, batch_first=True)
memory = torch.rand(4, 5, 64)
tgt = torch.rand(4, 3, 64)
out = decoder_layer(tgt, memory)

In the context of a translation task tgt are the embeddings of the translated (output) sentence, and memory is the output of the encoder model. Given that input’s outputs can have different lengths in the sentences, suppose I have the following attention_masks.

tgt_attention_mask = torch.LongTensor(
    [
        [1, 1, 1, 1, 1],
        [1, 1, 1, 0, 0],
        [1, 0, 0, 0, 0],
        [1, 1, 0, 0, 0],
    ]
)
memory_attention_mask = torch.LongTensor(
    [
        [1, 1, 1],
        [1, 1, 1],
        [1, 0, 0],
        [1, 0, 0],
    ]
)

Importantly, notice how the sequence length of the tgt is 5, while the memory is of length 3.

From my understanding of decoder transformers, there are two types of attention matrices. 1. Self attention matrix of size (batch_size, 5, 5), 2. Cross attention matrix of size (batch_size, 5, 3). The question is how do I pass these attention masks into the mask arguments in here. There are 4 mask arguments in TransformerDecoderLayer and I’m not sure how these came to be.

Thanks in advance.

Aside

Incase you need it the cross attention matrix would look like this:

cross_attention_mask = tgt_attention_mask.unsqueeze(-1) @ memory_attention_mask.unsqueeze(1)