Auto-regressive inference with transformer?

anteater · August 10, 2023, 10:35am

I am trying to use nn.Transformer to train a basic language model. The model trains, however at inference time I am having an issue and noted the following sentence from the docs:

Note: Due to the multi-head attention architecture in the transformer model, the output sequence length of a transformer is same as the input sequence (i.e. target) length of the decoder.

This generally makes sense, but is confusing to me how I am supposed to use inference with this in mind. I know from reading tutorials, the docs, etc that the usual way of performing inference is to begin with a single SOS token, use that as the input to the decoder and append the output to this input and keep going until EOS is generated, one at a time. However, due to that sentence, what is happening to me is my output is doubling in size each time. As in, the sizes of the input to the decoder and therefore the output is going from 1 → 2 → 4 → 8 etc because the inputs and outputs must be the same length.

My expectation is that we auto regressively generate one word at a time, but I’m not sure what I’m misunderstanding or how to do this with this in mind. Any help would be appreciated.