What is the reason behind this restriction?
In the documentation for all recurrent layers is written:
dropout – If non-zero, introduces a Dropout layer on the outputs of each RNN layer except the last layer
But why? Is it an implementation issue? Or is there research on this topic?
When using only 1 LSTM layer I would not be able to use dropout, but it helps performance (when implemented manually) for my (time series forecasting) problem.
Thank you very much