Attention Is All You Need (Transformer)

arjung · August 7, 2019, 3:18am

I was looking at the paper titled “Attention Is All You Need” (https://arxiv.org/pdf/1706.03762.pdf) which introduces the Transformer, an encoder-decoder sequence model based solely on attention mechanisms.

Here’s a snippet of the code from a PyTorch implementation (https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/train.py):

# prepare data
src_seq, src_pos, tgt_seq, tgt_pos = map(lambda x: x.to(device), batch)
gold = tgt_seq[:, 1:]

# forward
optimizer.zero_grad()
pred = model(src_seq, src_pos, tgt_seq, tgt_pos)

# backward
loss, n_correct = cal_performance(pred, gold, smoothing=smoothing)
loss.backward()

According to this, tgt_seq is one of the inputs to the decoder, while gold is the target of the decoder. How is it fair that one of the inputs to the model is tgt_seq, where gold, the variable used to compute the loss, is simply a slice of tgt_seq (an input to the model )?

Any ideas will be much appreciated, thanks in advance!

arjung · August 8, 2019, 6:21pm

I think that they are using “teacher forcing” (the concept of using the real target outputs as each next input, instead of using the decoder’s guess as the next input) which is why they pass in the target output as input to the decoder.