Dropout in LSTM

In the document of LSTM, it says:

dropout – If non-zero, introduces a dropout layer on the outputs of each RNN layer except the last layer

I have two questions:

  1. Does it apply dropout at every time step of the LSTM?
  2. If there is only one LSTM layer, will the dropout still be applied?

And it’s very strange that even I set dropout=1, it seems have no effects on my network performence. Like this:

self.lstm1 = nn.LSTM(input_dim, lstm_size1, dropout=1, batch_first=False)

this is only 1 layer, so I doubt if the dropout really works.

2 Likes
  1. Yes, dropout is applied to each time step, however, iirc, mask for each time step is different
  2. If there is only one layer, dropout is not applied, as indicated in the docs (only layer = last layer).
10 Likes

Thank you!

And Yes, your answer has be proved by my experiments.

I manually add a dropout layer after lstm, and it works well.

I have been stucked in this bug for a long time! the same data and the same config, it’s always overfitting with the pytorch version. Finally! Thanks!

2 Likes

Hi Young,

I was wondering if you manually add a dropout layer after LSTM, will the dropout mask be the same for all the time steps in a sequence? Or it will be different for each time step.

Thanks

But in this post the figure shows it is not…
Which claim is true?
Thank you!

The RNN or LSTM network recurs itself for every step, which means in every step it’s like a normal fc network. So the dropout is applied to each time step.

In the newer version of pytorch, 1-layer rnn does not have a valid argument as dropout, so the dropout was not applied to each step, unless it is manually implemented (re-write the rnn module)

Yes, I guess your description would be more clear.
I was trying to explain that dropout would be applied in every time step, which means on every h_t the dropout works.

Dropout was placed in the middle of 2 stacked RNN unit.

1 Like

I have seen the source code, I tend to think that claim 2 is right, dropout works between layers but not every timestep.

        for i in range(num_layers):
            all_output = []
            for j, inner in enumerate(inners):
                l = i * num_directions + j

                hy, output = inner(input, hidden[l], weight[l], batch_sizes)
                next_hidden.append(hy)
                all_output.append(output)

            input = torch.cat(all_output, input.dim() - 1)

            if dropout != 0 and i < num_layers - 1:
                input = F.dropout(input, p=dropout, training=train, inplace=False)
3 Likes