shouldn’t tgt_mask be always there, because we do not want nn.TransformerDecoder to look at ahead words during training.
for example if input to nn.TransformerDecoder is the sequence,
<sos> hello world.
in nn.TransformerDecoderLayer
tgt2 = self.self_attn(tgt, tgt, tgt, attn_mask=tgt_mask,
key_padding_mask=tgt_key_padding_mask)[0]
then if we do not provide tgt_mask, and pass (tgt, tgt, tgt) in self_attn, then the representation of ‘hello’ would consider ‘world’ also, when computing self attention.