introduces a dropout layer on the outputs of each RNN layer except the last layer
I just want to clarify what is meant by “everything except the last layer”.
Below I have an image of two possible options for the meaning.
Option 1: The final cell is the one that does not have dropout applied for the output.
Option 2: In a multi-layer LSTM, all the connections between layers have dropout applied, except the very top layer. So a single layer LSTM would not have any dropout applied.
LSTM is a rolled up LSTMCell and to think of each layer as Performing one LSTMCell action and that should help make sense of it. Number layers being the number of time steps performed
Time steps I’m referring from going from each element in the input. Like if you fed a sentence as sequence the order of the words in that sentence matters. Hence why the pytorch docs say:
“For each element in the input sequence, each layer computes the following function”
Nn.LSTM is a window LSTM Aka rolled up LSTM
And would be option 2
Which sorry I misread you Tom thought you were disagreeing with that but read incorrectly and what I wrote was wrong it is multilayer for each time step😲