num_layers is discussed here. I can vaguely understand what it is. But my understanding of the details is still fuzzy, because the standard LSTM is only a single layer.
In the notation of http://blog.echen.me/2017/05/30/exploring-lstms/ , y_t = V h_t. So “output” is a transform of the hidden state in that terminology. However, I think in multi-layer LSTM, it is actually the hidden state from the lower layer is feeded to the upper layer. Is it so?
Is multi-layer the same as stacking?
Do two LSTMs have exactly the same parameters or they have different parameters? If their parameters are different, are their size the same? Do you have an minimal working implementation to help resolve the ambiguity?
At each timestep an LSTM unit receives a new input and combines this with the value in its hidden state AND with its previous output value. It produces a new hidden value, and an output value.
This output value is fed into the next layer of your model.
Suppose you use LSTM(input_size=10, hidden_size=20, num_layers=2) this command packages up two LSTM layers in order to accelerate some calculations. This LSTM block expects to receive 10 values per batch at each timestep. The first layer containing 20 hidden units transforms these 10 values in 20 different ways. The output of this first layer is then passed to the second layer which processes these 20 values in 20 more ways. The output of this second layer is the output of this LSTM block.
Is that any clearer?
To answer your last question. Every single LSTM unit in your model has its own set of weights.