LSTM/GRU/RNN (dropout) Pytorch implementation following TensorFlow one

nsacco · January 8, 2021, 5:52pm

Hi guys! It is some months that I’ve moved from TF to Pytorch. While I am enjoying speed and flexibility, I am struggling in replicating results of one of my previous TF works in Pytorch. Specifically, I am talking about a seq2seq model (which I am now extending with attention, but let’s forget about this). I’ve fixed the “basic” discrepancy given by different weights initialization. My major concern is about dropout. As you might now, TF implements two different (variational) dropouts:

dropout - Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs. Default: 0.

recurrent_dropout - Float between 0 and 1. Fraction of the units to drop for the linear transformation of the recurrent state. Default: 0.

default nn.LSTM Pytorch implementation is, as far as I’ve understood, completely different. In fact, I am not able to reproduce the same results I had in TF just using standard Pytorch dropout (which should not be variational). Using LSTM implementation https://github.com/keitakurita/Better_LSTM_PyTorch/blob/master/better_lstm/model.py, as suggested in Variational dropout? - #9 by Iridium_Blue I get a huge performance improvement, even if I am still struggling in replicating the results that I got in TF.

So, my question: Is anyone aware of an implementation of recurrent layers (so, not just LSTM) that is very close to the TF one?

I think it should be not so much difficult to extend Variational dropout? - #9 by Iridium_Blue implementation to GRU and RNN, but I am anyhow interested in other implantations, if available.

torch_torch · April 21, 2023, 11:31pm

Same issue. After I migrate from TF to torch, the performance suffers a larger drop. Basically, the dropout plays a role here.

vdw · April 23, 2023, 7:46am

Do you have some code of your model to show? There are typically 2 issues that often crop up yielding subpar results:

The incorrect output from the LSTM layer is used for further processing
The data gets scrambled up due to inconsiderate use of view() or reshape()