Help with LSTM output

Hi all, I think I have a a misunderstanding of how to use lstms. I’ve read through all the docs and also a bunch of lstm examples.

I am trying to do something basic, which is just to take the output of an lstm and pass it through a linear layer, but the sizes dont seem to be coming out properly.
My batch size is 128 and that is what I expect at the final line for my forward, but instead I get back 22036.

class MyRNN(nn.Module):
def __init__(self, embed_size, hidden_size, vocab_size, num_layers):
    super(MyRNN, self).__init__()
    self.embed = nn.Embedding(vocab_size, embed_size)
    self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
    self.linear = nn.Linear(hidden_size,4800)

def init_weights(self):
    """Initialize weights.""", 0.1), 0.1)

def forward(self, features, captions, lengths):
    embeddings = self.embed(captions)
    print("embedding size:"+str(embeddings.size()))
    embeddings =, embeddings), 1)
    packed = pack_padded_sequence(embeddings, lengths, batch_first=True)
    rnn_features, _ = self.lstm(packed)
    outputs = self.classifier(rnn_features[0])
    #output should be of size 128 * 4800, not 22036 * 4800
    return outputs

Here are my print statement outputs:
captions size:torch.Size([128, 362])
captoins size:torch.Size([128, 302])
padded captions size:torch.Size([22036])
embedding size:torch.Size([128, 302, 256])
packed sizetorch.Size([22036, 256])
rnn_features:torch.Size([22036, 512])

It looks like the issue is that I dont understand pack_padded_sequence. The docs say output is “The returned Variable’s data will be of size TxBx*, where T is the length of the longest sequence and B is the batch size. If batch_first is True, the data will be transposed into BxTx* format.” But it seems like the output is just 21456??? Why is that?

What is the point of pack_padded_sequence? It seems optional to use for RNNs. I see some code that uses them and some that don’t. Is it better to use these sequences vs Tensors?

it seems if you use torch.nn.utils.rnn.pack_padded_sequence() then you don’t need to pass h_0 and c_0? hard to tell, the docs don’t really say.
And for the input of LSTM, from the docs it says “input (seq_len, batch, input_size): tensor containing the features of the input sequence. The input can also be a packed variable length sequence.” Can someone explain what the difference between seq_len and input_size is?

Any help would be greatly appreciated, I’ve been stuck on this for a while, trying to fix it on my own as it seems it should be easy to fix, but everything i’ve tried does not work.