Attention Is All You Need (Transformer)

arjung · August 8, 2019, 6:21pm

I think that they are using “teacher forcing” (the concept of using the real target outputs as each next input, instead of using the decoder’s guess as the next input) which is why they pass in the target output as input to the decoder.