Scr_mask size in Transformers for Language Model

sleepybear · October 17, 2021, 8:44am

The src_mask has been used to prevent Transformers from cheating! Then in language model task, the size of src_mask (or in source code of pytorch attn_mask) should be same as the length of the seq_len x seq_len (Is this utterance true?). However I find the toy example of Pytorch (example link) which generates this mask as size of batch_size x batch_size. Which one is true? If the latter is true, can you explain me why we have to generate mask by size batch_size x batch_size?