How does dropout in LSTM/GRU work?

In the docs it is stated that dropout is applied to the output of intermediate layers. My question is, what kind of dropout? Is is the normal Dropout layer, which drops completely random? Or is it something like Dropout2D, which drops along the feature axis (still not optimal for time series data)? Unfortunately, the source code does not reveal an answer, as the dropout parameter is given into some C function, which already outputs the results.

Bonus question, does the dropout parameter work correctly with Packed Sequences as input?

Double bonus question, if I want to have spatial dropout (drop along the time axis) and implement it myself while working with packed sequences, would it be correct to do the following:
x = pack_padded_sequence(x, …)
x, _ = lstm1(x) # only one layer
x = pad_packed_sequence(x, …)
x = x.permute(0,2,1)
x = Dropout2D()(x)
x = x.permute(0,2,1)
x = pack_padded_sequence(x, …)
x, _ = lstm2(x) # only one layer
x = pad_packed_sequence(x, …)

Back in the day, the dropout was just randomly on each element without any structure.
This was discussed a bit including on this forum in the early days when Gal and Ghahramani ([1512.05287] A Theoretically Grounded Application of Dropout in Recurrent Neural Networks) proposed a dropout scheme that is kept fixed across the “time” dimension. You can have more control by using LSTMCell and doing the time loop yourself, at the expense of some speed which you may (partially?) recover through pre-compilation.

Best regards

Thomas

1 Like