How does the transformer output one word?

Hello everyone,

as far as I understand, the grey cell maintains the shape [batch, sequence, dim] all the way through.

So during infernece, how is last decoder cell supposed to give the probabilities of a single word, if the output shape is [batch, sequence, dim] with sequence > 1 ?

My hypothesis is the following, is that correct?

iteration 1: dec_inp = [start] → dec_out = [y0] append last out element to input (here: y0)
iteration 2: dec_inp = [start, y0] → dec_out = [y0, y1] append last out element to input (here: y1)
iteration 3: dec_inp = [start, y0,y1] → dec_out = [y0, y1,y2]

So you do not actually predict one word, but the whole sequence that you have ‘so far’ and then append the last element of that sequence to the input of the new iteration.

Thanks in advance

Why not stay in one discussion?

It’s a different question, should I move it over to the other post?

Honestly I find it better, especially since the answer was given in the latter.