Hi
I have the following Bidirectional LSTM language model
drop = nn.Dropout(dropout)
word_embeddings = nn.Embedding(vocabSize, input_size)
rnn = nn.LSTM(input_size,
hidden_size,
n_layers,
dropout=dropout,
bidirectional=True)
fc = nn.Linear(hidden_size*numDir, vocabSize)
Assume the following
vocabSize=107
batchsize=4
input_size=100
hidden_size=200
n_layers=1
dropout=0.1
bidir=True
numDir = 2
My vocab is a list of 107 tokens with 0th position as <pad>
.
I am using pack_padded_sequence and pad_packed_sequence to pack and unpack my input in the forward()
emb = drop(word_embeddings(input))
packed_input = pack_padded_sequence(emb,seqLengths)
packed_output, (hT, cT) = rnn(packed_input, hidden)
ht, _ = pad_packed_sequence(packed_output)
logits = fc(ht.view(ht.size(0)*ht.size(1), ht.size(2)))
My input dimensions are (T,B,*), where T is sequence length, B is batch size. The inputs are left aligned 0padded and sorted by their input lengths:
seqLengths = [19, 16, 16, 16]
for a particular batch. The dimensions I observe after forward() are
input = (19,4)
emb = (19,4,100)
ht = (19,4,400)
logits = (76,107)
which all make perfect sense. And the model seems to work fine in left-right single direction. However, in the bidirectional mode the model predicts <pad>
for every position of the sequence.
One hypothesis I was working with was that the padding, being the last element (and 0th position in the vocab) is killing the gradients in the backward pass. So I attempted to use the packed_output directly for the FC layer directly (i.e not running pad_packed_sequence()). But that creates dimension mismatches. I also tried these dimension selection methods that pass the sequence lengths but select only the last element of the sequences ( [About the variable length input in RNN scenario](About the variable length input in RNN scenario and Indexing Multi-dimensional Tensors based on 1D tensor of indices. However, I need the full elements of the tensor, not just the last none-zero.
Am I thinking incorrectly about this problem? why is the bidirectional lstm always predicting the padding token in every timestep in the setup above?
thank you