It’s an attention mask working on the second input of transformer decoder layer. Within the encoder-decoder architecture, it works on the output of transformer encoder, which we call it “memory”.
I am wondering how to generate the memory mask? The generate_square_subsequent_mask function can only generate square masks, but memory_mask requires the dimension (T, S). I am wondering is there a built in function in transformer?? Thank you!