src_mask has been used to prevent Transformers from cheating! Then in language model task, the size of
src_mask (or in source code of pytorch
attn_mask) should be same as the length of the
seq_len x seq_len (Is this utterance true?). However I find the toy example of Pytorch (example link) which generates this mask as size of
batch_size x batch_size. Which one is true? If the latter is true, can you explain me why we have to generate mask by size
batch_size x batch_size?