Bi-directional and multi-layer LSTM in Seq2Seq auto-encoders

Hello everyone,

I do not have a Pytorch issue to report but I would like to ask for good practices / recommendations on using bi-directional and multi-layer LSTMs for a Seq2Seq auto-encoder please.

Before I give details, when I train my model with default LSTM(num_layers=1,bidirectional=False) for both encoder and decoder I have some decent reconstruction results on the task. I try using either several layers, or bidirectional layer and results turns out pretty bad whereas I hoped it would help the model training. I have several hundreds of thousands of training samples and the model capacity is quite low so I don’t think it comes from over-parameterizing the model but rather I guess I am not using these functionalities well … Here I come to ask your advices about it.

In the default case: (we assume everywhere batch_first=True)

E_output , _ = encoder_RNN ( input of shape = [batch,seq,in_dim] )
E_global_context = E_output[:,-1,:] of shape = [batch,hid_dim]

D_global_context = global_context.unsqueeze(1).repeat(1seq,1)
D_output , _ = decoder_RNN ( D_global_context ) of shape = [batch,seq,out_dim]

multi-layer encoder_RNN: the encoder output is the same (independent from num_layers)

bi-directional encoder_RNN: concatenate the forward and reverse last states

E_output = E_output.view(batch,seq,2,hid_dim)
E_global_context =[E_output[:,-1,0,:], E_output[:,0,1,:]],-1) of shape = [batch,2*hid_dim]

multi-layer decoder_RNN: the decoder output is the same (independent from num_layers)

bi-directional decoder_RNN: D_output of shape = [batch,seq,out_dim*2]

Before and after the RNNs is a Linear layer to adapt to the actual input feature size, latent context size and output feature size. I am just wondering if I am correctly treating the outputs of multi-layer / bi-directional RNNs since this causes much worse results than using the default LSTM settings.

Thanks for reading and for your advices !

Overall it looks fine. In this part:

E_global_context =[E_output[:,-1,0,:], E_output[:,0,1,:]],-1)

I was wondering should it be instead:

E_global_context =[E_output[:,-1,0,:], E_output[:,-1,1,:]],-1)

That’s a good point I have been wondering about too.

According to the LSTM doc:

output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features (h_t) from the last layer of the LSTM, for each t.

To my understanding, for the forward pass, we want -1 which is the last step T.
For the backward pass, we want the step 0 which is the last reverse step.

Maybe someone around the forum can confirm whether[E_output[:,-1,0,:], E_output[:,0,1,:]],-1) does that or not ?