Perhaps, I am still wrapping my head around beam search.
I completely understand how it is described here.
What I don’t understand is why do we need to keep track of hidden states as oppose to keeping track of the top k predictions. What am I missing? The pseudo code is below. I’ve also been looking at this example.
for each element in the test-set
calculate initial k (decoder-encoder step)
for range(timesteps-1)
for each prev k
get hidden state
obtain its best k
save hidden state
find new k from k*k possible ones
##update hypotheses based on new found k
for element in k
copy hidden state
change hypotheses if necessary
append new k to hypotheses
Is it because when I provide k words as input to the next timestep so that it can predict the next word in the sentence, thus, I need to pass the hidden as well in order to predict the next word?