Can you please tell me what is the difference between the two sets of masks viz. ***_mask and ***_key_padding_mask?
From the documentation in the source code, this is what I could deduce. But I am not very confident and hence would really appreciate it if you can correct me:
- src_mask, tgt_mask and memory_mask should be used when we want to apply the same mask to all the sequences in the given batch.
- src_mask, tgt_mask, tgt_mask, tgt_mask and memory_mask, tgt_mask should be used when we want to specify different masks for different samples in the given batch. Also, the way you specify the masks is slightly different from the previous one.
My question is: Do both set of masks achieve the same purpose? And should we be using either one of them?
For instance, if you want to create a Seq2Seq Transformer model with both TransformerEncoder and TransformerDecoder, is it ok, if I only specify src_mask, tgt_mask and memory_mask?