Regarding tgt_mask: (T, T) set as optional in nn.TransformerDecoder

shouldn’t tgt_mask be always there, because we do not want nn.TransformerDecoder to look at ahead words during training.
for example if input to nn.TransformerDecoder is the sequence,

<sos> hello world.

in nn.TransformerDecoderLayer

tgt2 = self.self_attn(tgt, tgt, tgt, attn_mask=tgt_mask,

then if we do not provide tgt_mask, and pass (tgt, tgt, tgt) in self_attn, then the representation of ‘hello’ would consider ‘world’ also, when computing self attention.