The output contains the output of LSTM for all timesteps for all sequences.
The hidden has two components.
The output of last timestep (of the individual sequences).
The cell state after last timestep
In my understanding, you may want to pass the first component of the hidden variable (hidden[0]) to the decoder, as it represents the encoding of the whole sentence after having seen all the words in the sentence. Again, its a design choice about your network.