Here are the confusing parts.
- In the notation of http://blog.echen.me/2017/05/30/exploring-lstms/ , y_t = V h_t. So “output” is a transform of the hidden state in that terminology. However, I think in multi-layer LSTM, it is actually the hidden state from the lower layer is feeded to the upper layer. Is it so?
- Is multi-layer the same as stacking?
- Do two LSTMs have exactly the same parameters or they have different parameters? If their parameters are different, are their size the same? Do you have an minimal working implementation to help resolve the ambiguity?