In the document of LSTM, it says:
dropout – If non-zero, introduces a dropout layer on the outputs of each RNN layer except the last layer
I have two questions:
Does it apply dropout at every time step of the LSTM?
If there is only one LSTM layer, will the dropout still be applied?
And it’s very strange that even I set dropout=1, it seems have no effects on my network performence. Like this:
self.lstm1 = nn.LSTM(input_dim, lstm_size1, dropout=1, batch_first=False)
this is only 1 layer, so I doubt if the dropout really works.
And Yes, your answer has be proved by my experiments.
I manually add a dropout layer after lstm, and it works well.
I have been stucked in this bug for a long time! the same data and the same config, it’s always overfitting with the pytorch version. Finally! Thanks!
I was wondering if you manually add a dropout layer after LSTM, will the dropout mask be the same for all the time steps in a sequence? Or it will be different for each time step.
documentation for LSTM, for the dropout argument, it states:
introduces a dropout layer on the outputs of each RNN layer except the last layer
I just want to clarify what is meant by “everything except the last layer”.
Below I have an image of two possible options for the meaning.
Option 1: The final cell is the one that does not have dropout applied for the output.
Option 2: In a multi-layer LSTM, all the connections between layers have dropout applied, except the very top lay…
But in this post the figure shows it is not…
Which claim is true?
The RNN or LSTM network recurs itself for every step, which means in every step it’s like a normal fc network. So the dropout is applied to each time step.
In the newer version of pytorch, 1-layer rnn does not have a valid argument as dropout, so the dropout was not applied to each step, unless it is manually implemented (re-write the rnn module)
Yes, I guess your description would be more clear.
I was trying to explain that dropout would be applied in every time step, which means on every h_t the dropout works.
Dropout was placed in the middle of 2 stacked RNN unit.
I have seen the source code, I tend to think that claim 2 is right, dropout works between layers but not every timestep.
for i in range(num_layers):
all_output = 
for j, inner in enumerate(inners):
l = i * num_directions + j
hy, output = inner(input, hidden[l], weight[l], batch_sizes)
input = torch.cat(all_output, input.dim() - 1)
if dropout != 0 and i < num_layers - 1:
input = F.dropout(input, p=dropout, training=train, inplace=False)