How does the transformer output one word?

Arthur_Zakirov · June 11, 2021, 12:46pm

Hello everyone,

as far as I understand, the grey cell maintains the shape [batch, sequence, dim] all the way through.

So during infernece, how is last decoder cell supposed to give the probabilities of a single word, if the output shape is [batch, sequence, dim] with sequence > 1 ?

My hypothesis is the following, is that correct?

iteration 1: dec_inp = [start] → dec_out = [y0] append last out element to input (here: y0)
iteration 2: dec_inp = [start, y0] → dec_out = [y0, y1] append last out element to input (here: y1)
iteration 3: dec_inp = [start, y0,y1] → dec_out = [y0, y1,y2]

So you do not actually predict one word, but the whole sequence that you have ‘so far’ and then append the last element of that sequence to the input of the new iteration.

Thanks in advance

pascal_notsawo · June 11, 2021, 12:51pm

Why not stay in one discussion?

Arthur_Zakirov · June 11, 2021, 12:52pm

It’s a different question, should I move it over to the other post?

pascal_notsawo · June 11, 2021, 12:55pm

Honestly I find it better, especially since the answer was given in the latter.