I am trying to build a seq2seq attention-based model. In the literature, people often use deep lstm layers and each layer in this case produces a hidden state which is passed to the next one. If I use N layers, at the end I have N hidden states. What is the common approach to work with these ?
One can add all N and then create one hidden state, or I can concatenate them or I can just consider the outputs from the very last layer. I know that all of them are possibly useful approaches but what is the most common way of handling this ?