Confusion on Transformer src, tgt and loss calculation

JLenz · December 27, 2023, 5:59pm

Hello!

I’m training a standard(ish) transformer which has both the encoder and decoder. Ignoring padding for a moment, I’m confused about the general format of my data for the inputs to the encoder, decoder and loss calculation.

At the moment I’ve got it set up so that I have:
src: the input, with an [end_token] at the end
tgt: the input, with [start_token] appended at the beginning, thus ‘shifting’ it one to the right (also with an [end_token])

My model then produces an output of shape [batch, seq_length, vocab_size] and I calculate loss with:
prediction_loss = criterion(output, tgt)

However I am confused if it should be calculating loss against ‘tgt’ (which includes the [start_token]) or just ‘src’, or something else entirely? In other words, in a ‘perfect’ prediction, should there be a start/end token in the sequence?

As a follow up question, should the [end_token] be present in both sequences during training?

Many thanks in advance. Sorry if this is obvious, I have looked over many tutorials and codebases and am still quite confused on this topic.