I was following a tutorial on transformers in language modelling ( Language Modeling with nn.Transformer and TorchText — PyTorch Tutorials 1.9.0+cu102 documentation ) and I came across a bunch of questions.
What exactly does the particular model in this tutorial return? When I feed it with a sequence of N length (in one batch), it returns always a N x B x V array (where B is batch dim, and V vocab dim). When I do argmax on the V dimension and decode it back to words, I get an array of N words. Why N? If the last layer is Linear, shouldn’t I get just one word? After all, given a sequence, I care about words that are extension to that sequence.
What exactly each word in the generated sequence mean? Are the first generated words based on first words in the input sequence and the last are based on whole input sequence? So if I want a next word prediction for input sequence, I should access only the last element in N dimension?
Also, if I’d like to get M words following N sequence, am I supposed to run the model M times, feeding it each time with a sequence N with attached newly generated words?
The model asks for mask in the forward function. If I understand it correctly, whether I skip it or not, when predicting with this model, the last word in the generated sequence should be the same, because generating the last word in the attention mechanism should have access to all tokens. However, it is not true for all cases. Why?
Also, if I’d teach the model not by a sequence shifted by 1 right, but for example by 10 right, and the bptt also was 10, so there was no way for attention mechanism to access following words, do I understand it correctly that then it would be no point in masking?
Thanks in advance for answers.