Perhaps, I am still wrapping my head around beam search.
I completely understand how it is described here.
What I don’t understand is why do we need to keep track of hidden states as oppose to keeping track of the top k predictions. What am I missing? The pseudo code is below. I’ve also been looking at this example.
for each element in the test-set calculate initial k (decoder-encoder step) for range(timesteps-1) for each prev k get hidden state obtain its best k save hidden state find new k from k*k possible ones ##update hypotheses based on new found k for element in k copy hidden state change hypotheses if necessary append new k to hypotheses
Is it because when I provide k words as input to the next timestep so that it can predict the next word in the sentence, thus, I need to pass the hidden as well in order to predict the next word?