Create a Bidirectional Stacked LSTM with Deep Hidden to Hidden Transition

I’d like to build with Pytorch a Bidirectional Stacked LSTM ( Stacked DT-RNN)with fully connected layers between the hidden states as suggested in Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, and Bengio, Yoshua. How to Construct Deep Recurrent Neural Networks. arXiv
preprint arXiv:1312.6026, pp. 1–10, 2013.
Pythorch nn.LSTM don’t support this features: you don’t have the control on each time step. So you can’t add a fully connected network which takes in input the hidden state at time t and gives an output for the hidden state at time t+1 . So i suppose that the usage of nn.LSTMCell is the only option. At this point many issues born:

  1. LSTM cell implementation is not bidirectional and it works on single timesteps. So if i’d like to build a 2 layer LSTM i need 2 LSTM cell and 4 for loops in the forward method to iterate over the sequences ( one for the original sequence and one for the reversed one for each layer). Operations which i am constrained to do sequentially, but that are strongly parallelizable. It would be very slow.
  2. The input sequences, which i use, have different lenghts; So i have padded them. LSTM cell don’t support pad_packed_sequences, because it works on a single time step. I’d like to work with batch samples to have reasonable time performances. I have tried to get the the resulting tensor from torch.nn.utils.rnn.pack_padded_sequence(myBatch) and feed it into a single LSTM cell, grouping them in smaller batches (let us call them “mini_batches”), whose length vary with the length of sequences. In this case the issue is that the mini_batches have different sizes and consequently the hidden state size to feed in the hidden-to-hidden transition too.
    Do you have any suggestions to solve these issues?
    Thank you in advance


I am looking for an implementation of the DT-RNN but facing the same issues are you are. Just wanted to ask if you have found any solution to this?