I was looking at the paper titled “Attention Is All You Need” (https://arxiv.org/pdf/1706.03762.pdf) which introduces the Transformer, an encoder-decoder sequence model based solely on attention mechanisms.
Here’s a snippet of the code from a PyTorch implementation (https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/train.py):
# prepare data src_seq, src_pos, tgt_seq, tgt_pos = map(lambda x: x.to(device), batch) gold = tgt_seq[:, 1:] # forward optimizer.zero_grad() pred = model(src_seq, src_pos, tgt_seq, tgt_pos) # backward loss, n_correct = cal_performance(pred, gold, smoothing=smoothing) loss.backward()
According to this,
tgt_seq is one of the inputs to the decoder, while
gold is the target of the decoder. How is it fair that one of the inputs to the model is
gold, the variable used to compute the loss, is simply a slice of
tgt_seq (an input to the model )?
Any ideas will be much appreciated, thanks in advance!