I was looking at the paper titled “Attention Is All You Need” (https://arxiv.org/pdf/1706.03762.pdf) which introduces the Transformer, an encoder-decoder sequence model based solely on attention mechanisms.
Here’s a snippet of the code from a PyTorch implementation (https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/train.py):
# prepare data
src_seq, src_pos, tgt_seq, tgt_pos = map(lambda x: x.to(device), batch)
gold = tgt_seq[:, 1:]
# forward
optimizer.zero_grad()
pred = model(src_seq, src_pos, tgt_seq, tgt_pos)
# backward
loss, n_correct = cal_performance(pred, gold, smoothing=smoothing)
loss.backward()
According to this, tgt_seq
is one of the inputs to the decoder, while gold
is the target of the decoder. How is it fair that one of the inputs to the model is tgt_seq
, where gold
, the variable used to compute the loss, is simply a slice of tgt_seq
(an input to the model )?
Any ideas will be much appreciated, thanks in advance!