LSTM dropout - Clarification of Last Layer

ronrest · July 30, 2017, 9:54am

In the documentation for LSTM, for the dropout argument, it states:

introduces a dropout layer on the outputs of each RNN layer except the last layer

I just want to clarify what is meant by “everything except the last layer”.

Below I have an image of two possible options for the meaning.

Option 1: The final cell is the one that does not have dropout applied for the output.
Option 2: In a multi-layer LSTM, all the connections between layers have dropout applied, except the very top layer. So a single layer LSTM would not have any dropout applied.

I tried looking at the source code for LSTM and RNNBase, but I can’t figure out how it is being applied.

dgriff · July 30, 2017, 10:52am

LSTM is a rolled up LSTMCell and to think of each layer as Performing one LSTMCell action and that should help make sense of it. Number layers being the number of time steps performed

tom · July 30, 2017, 1:02pm

Hello,

from both the text you quoted and the code, I’d say likely option 2. This is my understanding of layers.

If you believe that, after disabling cudnn, this line is relevant, it might also give a hint.

github.com

pytorch/pytorch/blob/master/torch/nn/_functions/rnn.py#L91


    for j, inner in enumerate(inners):
        l = i * num_directions + j


        hy, output = inner(input, hidden[l], weight[l])
        next_hidden.append(hy)
        all_output.append(output)


    input = torch.cat(all_output, input.dim() - 1)


    if dropout != 0 and i < num_layers - 1:
        input = F.dropout(input, p=dropout, training=train, inplace=False)


if lstm:
    next_h, next_c = zip(*next_hidden)
    next_hidden = (
        torch.cat(next_h, 0).view(total_layers, *next_h[0].size()),
        torch.cat(next_c, 0).view(total_layers, *next_c[0].size())
    )
else:
    next_hidden = torch.cat(next_hidden, 0).view(
        total_layers, *next_hidden[0].size())

My understanding of this code is that it does a single timestep (your horizontal axis) for many layers.

Best regards

Thomas

dgriff · July 30, 2017, 2:01pm

Time steps I’m referring from going from each element in the input. Like if you fed a sentence as sequence the order of the words in that sentence matters. Hence why the pytorch docs say:

“For each element in the input sequence, each layer computes the following function”

Nn.LSTM is a window LSTM Aka rolled up LSTM

And would be option 2

Which sorry I misread you Tom thought you were disagreeing with that but read incorrectly and what I wrote was wrong it is multilayer for each time step😲

STU · September 29, 2020, 6:37am

Hi, if it means the Option-1, why does the num_layers must great than 1?