Hey all, new here, perhaps I can help.
Thanks to kreitkurita of Carnegie Mellon University, who sub-classed LSTM to use Variational Dropout : Better LSTM with Variational Dropout
Just use this as a drop-in replacement for LSTM. The implementation is an almost faithful implementation of the original paper https://arxiv.org/abs/1512.05287 (see code comments for minor deviations.)
Some tips :
-
0.25 is a good initial choice for dropouti, dropoutw and dropouto (input, weight, and output respectively.)
-
It is probably best to avoid using other dropout techniques alongside this one (embedding-, batch-, layer-, etc.) At least at first. And possibly always. I need to look into this more.
-
In their original paper, Gal and Ghahramani note that weight_decay takes on new importance. They suggest 0.001 as a default. (This is set on the Optimizer. If you’re using Adam, I suggest looking into AdamW instead.)