The difference and use of output and hidden state of an RNN?

Like the title states: What’s the difference in using the Hidden State/Output of the last cell/state?
I have gone through various tutorials and code that utilise RNN’s(both GRU and LSTM) for tasks like Seq2Seq and Text Classification.

output, hidden = rnn("...", "...")#rnn = GRU/LSTM.

For Seq2Seq/Auto-Encoder etc. the output of the encoder is ignored and the hidden state of the last cell is used as an input to the decoder. And the Decoder uses the output to predict each word one by one.
However, for the Text Classification task the output of the last cell is used to predict the label after passing it through a feed-forward layer and some activation.

Why is the output of the last cell preferred over the hidden state in Text Classification task? Doesn’t the hidden state represent the whole sentence representation ? Or are they interchangeable?


Generally in an encoder-decoder setting the decoder is initialised using the final hidden state of the encoder. As you say, the idea is that the hidden state should represent the whole sentence.

In a labelling setting the output is generally used. For instance, when labelling parts of speech, for each word input, a part of speech is output. That said, when you are classifying the entire sequence, you don’t want a label for each step of input, you only want a single label at the end, so it seems logical to take the last output.

You could use the last hidden state to produce the label, and it might work marvellously, but then again, if the model uses the hidden state to store its intermediate representations of the sentence, then forcing the last hidden state to correspond to the label might interfere with the use of the hidden state for the intermediate representations.

An interesting experiment would be to take the hidden state and feed it into a linear layer or two in order to predict the sentence label.

1 Like

I was wondering the same as I have seen examples (for text classification) that use the last output (output[-1]) as well as the hidden state. See and I have also tried a siamese network with contrastive loss as well as MSE on both the output and the hidden state.

Lessons learned: when I built my lstm classifier I used the output and got accuracy around 98. With the hidden - not that good (don’t remember the scores). But interesting was the Siamese network - with MSE and output - great results. With contrastive and output -very bad results. MSE and hidden - bad results; contrastive and hidden - bad results. So basically using the hidden to compute a similarity score or the output has a huge difference in this scenario. However, it also depends (for my case) on the loss function.

Hope this gives some more clarity.


The output is the concatenation of the hidden state from every time step, whereas hidden is simply the final hidden state, is this statement true in the context of RNNs/LSTMs


Should I use output or hidden state to be fed into the fully connected layer to perform text classfication?


Is there any explanation about this question?