Bidirectional RNN predicts padding for each timestep

Hi

I have the following Bidirectional LSTM language model

drop = nn.Dropout(dropout)
word_embeddings = nn.Embedding(vocabSize, input_size)
rnn = nn.LSTM(input_size,
              hidden_size,
              n_layers,
              dropout=dropout,
              bidirectional=True)

fc = nn.Linear(hidden_size*numDir, vocabSize)

Assume the following

vocabSize=107
batchsize=4
input_size=100
hidden_size=200
n_layers=1
dropout=0.1
bidir=True
numDir = 2 

My vocab is a list of 107 tokens with 0th position as <pad>.

I am using pack_padded_sequence and pad_packed_sequence to pack and unpack my input in the forward()

emb = drop(word_embeddings(input))
packed_input = pack_padded_sequence(emb,seqLengths)
packed_output, (hT, cT) = rnn(packed_input, hidden)
ht, _ = pad_packed_sequence(packed_output)
logits = fc(ht.view(ht.size(0)*ht.size(1), ht.size(2)))

My input dimensions are (T,B,*), where T is sequence length, B is batch size. The inputs are left aligned 0padded and sorted by their input lengths:

seqLengths = [19, 16, 16, 16]

for a particular batch. The dimensions I observe after forward() are

input	= (19,4)
emb 	= (19,4,100)
ht	    = (19,4,400)
logits	= (76,107)

which all make perfect sense. And the model seems to work fine in left-right single direction. However, in the bidirectional mode the model predicts <pad> for every position of the sequence.

One hypothesis I was working with was that the padding, being the last element (and 0th position in the vocab) is killing the gradients in the backward pass. So I attempted to use the packed_output directly for the FC layer directly (i.e not running pad_packed_sequence()). But that creates dimension mismatches. I also tried these dimension selection methods that pass the sequence lengths but select only the last element of the sequences ( [About the variable length input in RNN scenario](About the variable length input in RNN scenario and Indexing Multi-dimensional Tensors based on 1D tensor of indices. However, I need the full elements of the tensor, not just the last none-zero.

Am I thinking incorrectly about this problem? why is the bidirectional lstm always predicting the padding token in every timestep in the setup above?

thank you

I came across this post beacuse i had a problem in taking the output of the padded input with bidirectional lstm and your case is the same.
I might not have have understood your problem completely but here’s what i think, You don’t have to pass the ‘ht’ as it is because it is a collection of hidden states over the timesteps.What you could do is select first and last timestep vectors and remove padding from both of them, concatenate and then pass it to the FC.
Please correct if i’m wrong and also reply if you found another way around.