Variational dropout?

Hey all, new here, perhaps I can help.

Thanks to kreitkurita of Carnegie Mellon University, who sub-classed LSTM to use Variational Dropout : Better LSTM with Variational Dropout

Just use this as a drop-in replacement for LSTM. The implementation is an almost faithful implementation of the original paper https://arxiv.org/abs/1512.05287 (see code comments for minor deviations.)

Some tips :

  • 0.25 is a good initial choice for dropouti, dropoutw and dropouto (input, weight, and output respectively.)

  • It is probably best to avoid using other dropout techniques alongside this one (embedding-, batch-, layer-, etc.) At least at first. And possibly always. I need to look into this more.

  • In their original paper, Gal and Ghahramani note that weight_decay takes on new importance. They suggest 0.001 as a default. (This is set on the Optimizer. If you’re using Adam, I suggest looking into AdamW instead.)