The src_mask
has been used to prevent Transformers from cheating! Then in language model task, the size of src_mask
(or in source code of pytorch attn_mask
) should be same as the length of the seq_len x seq_len
(Is this utterance true?). However I find the toy example of Pytorch (example link) which generates this mask as size of batch_size x batch_size
. Which one is true? If the latter is true, can you explain me why we have to generate mask by size batch_size x batch_size
?