Hello!
I’m training a standard(ish) transformer which has both the encoder and decoder. Ignoring padding for a moment, I’m confused about the general format of my data for the inputs to the encoder, decoder and loss calculation.
At the moment I’ve got it set up so that I have:
src: the input, with an [end_token]
at the end
tgt: the input, with [start_token]
appended at the beginning, thus ‘shifting’ it one to the right (also with an [end_token])
My model then produces an output of shape [batch, seq_length, vocab_size]
and I calculate loss with:
prediction_loss = criterion(output, tgt)
However I am confused if it should be calculating loss against ‘tgt’ (which includes the [start_token]) or just ‘src’, or something else entirely? In other words, in a ‘perfect’ prediction, should there be a start/end token in the sequence?
As a follow up question, should the [end_token]
be present in both sequences during training?
Many thanks in advance. Sorry if this is obvious, I have looked over many tutorials and codebases and am still quite confused on this topic.