Dropout in LSTM

Nick_Young · September 24, 2017, 2:01pm

In the document of LSTM, it says:

dropout – If non-zero, introduces a dropout layer on the outputs of each RNN layer except the last layer

I have two questions:

Does it apply dropout at every time step of the LSTM?
If there is only one LSTM layer, will the dropout still be applied?

And it’s very strange that even I set dropout=1, it seems have no effects on my network performence. Like this:

self.lstm1 = nn.LSTM(input_dim, lstm_size1, dropout=1, batch_first=False)

this is only 1 layer, so I doubt if the dropout really works.

ngimel · September 25, 2017, 1:20pm

Yes, dropout is applied to each time step, however, iirc, mask for each time step is different
If there is only one layer, dropout is not applied, as indicated in the docs (only layer = last layer).

Nick_Young · September 28, 2017, 2:41am

Thank you!

And Yes, your answer has be proved by my experiments.

I manually add a dropout layer after lstm, and it works well.

I have been stucked in this bug for a long time! the same data and the same config, it’s always overfitting with the pytorch version. Finally! Thanks!

ShuokaiPan · February 9, 2018, 3:42pm

Hi Young,

I was wondering if you manually add a dropout layer after LSTM, will the dropout mask be the same for all the time steps in a sequence? Or it will be different for each time step.

Thanks

cosmozhang1988 · February 23, 2018, 3:49pm

But in this post the figure shows it is not…
Which claim is true?
Thank you!

LinjX · September 23, 2018, 3:43am

The RNN or LSTM network recurs itself for every step, which means in every step it’s like a normal fc network. So the dropout is applied to each time step.

cosmozhang1988 · September 23, 2018, 3:48pm

In the newer version of pytorch, 1-layer rnn does not have a valid argument as dropout, so the dropout was not applied to each step, unless it is manually implemented (re-write the rnn module)

LinjX · September 23, 2018, 4:30pm

Yes, I guess your description would be more clear.
I was trying to explain that dropout would be applied in every time step, which means on every h_t the dropout works.

Dropout was placed in the middle of 2 stacked RNN unit.

FreyWang · September 27, 2018, 4:05am

I have seen the source code, I tend to think that claim 2 is right, dropout works between layers but not every timestep.

        for i in range(num_layers):
            all_output = []
            for j, inner in enumerate(inners):
                l = i * num_directions + j

                hy, output = inner(input, hidden[l], weight[l], batch_sizes)
                next_hidden.append(hy)
                all_output.append(output)

            input = torch.cat(all_output, input.dim() - 1)

            if dropout != 0 and i < num_layers - 1:
                input = F.dropout(input, p=dropout, training=train, inplace=False)