Memory_mask in nn.Transformer

I’m implementing training codes of transformer model using nn.Transformer.

In the documents, there is a memory_mask optional argument. I read the document but I don’t understand the purpose of this argument.

Could you explain what memory_mask is?

Additionally, is there any code that uses nn.Transformer module?

It’s an attention mask working on the second input of transformer decoder layer. Within the encoder-decoder architecture, it works on the output of transformer encoder, which we call it “memory”.

Then, Why is the shape of memory_mask (T, S)?

because the memory_mask works on the multihead_attn layer in decoder.