Padding for RNN with pack_padded_sequence

I would like to do binary sentiment classification of texts using an LSTM.
My problem is that the model trains for a batch size of 1 but not when processing multiple sentences in a batch.
I do not get runtime errors but the model simply does not learn anything for higher batch sizes, so I suspect something might be wrong with the padding or how I use pack/pad_padded_sequence in the LSTM.

This is my model:

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_size, hidden_size, num_layers, num_classes):
        super(RNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_size)

        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.type = type

        self.recurrent_layer = nn.LSTM(embedding_size, hidden_size, num_layers, dropout=0.5, batch_first=True)

        self.fc = nn.Linear(hidden_size, num_classes)

    def init_hidden(self, batch_size):
        h_0 = Variable(torch.zeros(self.num_layers, batch_size, self.hidden_size))
        c_0 = Variable(torch.zeros(self.num_layers, batch_size, self.hidden_size))

        if torch.cuda.is_available():
            h_0 = h_0.cuda()
            c_0 = c_0.cuda()

        return (h_0, c_0)

    def forward(self, inputs, lengths):
        embedded = self.embedding(inputs)

        embedded = nn.utils.rnn.pack_padded_sequence(embedded, list(, batch_first=True)  # pack batch

        initial_hidden_state = self.init_hidden(inputs.size()[0])
        r_out, last_hidden_state = self.recurrent_layer(embedded, initial_hidden_state)  # pass in LSTM model
        r_out, recovered_lengths = nn.utils.rnn.pad_packed_sequence(r_out, batch_first=True)  # unpack batch

        idx = (lengths - 1).view(-1, 1).expand(r_out.size(0), r_out.size(2)).unsqueeze(1)
        # get last hidden output of each sequence
        r_out = r_out.gather(1, idx).squeeze(dim=1)

        out = self.fc(r_out)
        return out

And this is how I train it:

def train(model, X_train, y_train, learning_rate, num_epochs, batch_size):
    # Loss and Optimizer
    criterion = nn.CrossEntropyLoss() # contains softmax layer and cross entropy loss, averages over examples in batch

    if torch.cuda.is_available():

    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

    # Train the Model
    for epoch in range(num_epochs):
        train_loss = 0.0
        for i, (inputs, lengths, labels) in enumerate(get_minibatches(X_train, y_train, batch_size, shuffle=True)):
            inputs = Variable(torch.LongTensor(inputs))
            labels = Variable(torch.LongTensor(labels))
            lengths = Variable(torch.LongTensor(lengths))

            if torch.cuda.is_available():
                inputs = inputs.cuda()
                labels = labels.cuda()
                lengths = lengths.cuda()

            # Forward + Backward + Optimize
            outputs = model(inputs, lengths)

            loss = criterion(outputs, labels)


            train_loss +=[0]

        print ('Epoch [%d/%d], Train loss: %.2f' %(epoch + 1, num_epochs, train_loss/(len(X_train)/batch_size)))

Here is how I create the minibatches and pad them

def pad(inputs):
    lengths = [len(x) for x in inputs]

    max_len = max(lengths)

    for input in inputs:
        for i in range(0, max_len - len(input)):
    return inputs, lengths

def get_minibatches(inputs, targets, batch_size, shuffle=False):
    assert len(inputs) == len(targets)
    examples = zip(inputs, targets)

    if shuffle:

    # take steps of size batch_size, take at least one step
    for start_idx in range(0, max(batch_size, len(inputs) - batch_size + 1), batch_size):
        batch_examples = examples[start_idx:start_idx + batch_size]

        batch_inputs, batch_targets = zip(*batch_examples)

        # pad the inputs
        batch_inputs, batch_lengths = pad(batch_inputs)
        # sort according to length
        batch_inputs, batch_lengths, batch_targets = zip(*sorted(zip(batch_inputs, batch_lengths, batch_targets), key=operator.itemgetter(1), reverse=True))

        yield list(batch_inputs), list(batch_lengths), list(batch_targets)

I have already checked that the inputs are padded correctly, the inputs, lengths, targets match in the batches, I have also looked at the results of pack_padded_sequence, pad_padded_sequence and the r_out.gather operation and verified that they look correct and the correct last LSTM state is selected.
However, the network does not learn anything for batch sizes higher than 1, the loss always stays the same throughout the epochs.
Can anyone spot what I overlooked?


Did you find solution to this ? I also have same use case and facing same problem as you.


unfortunately I was not able to find the problem yet.
I have also tested my network on the MNIST image classification dataset, where it would train properly as well even for larger batch sizes.
The difference is that all images in this dataset are of the same size, so no padding is needed.
Now I know at least that the bug is either in my sequence padding function (although I debugged it and everything seemed fine) or how the RNN handles the padded sequences.
Let me know if you encounter a solution to the problem or get an RNN with padded sequences to work!

Hello, has anyone been able to identify an error here? I am facing the same issue on a different dataset and have packed/padded the sequences the same way.

Hi, I just checked your code, and it seems that nothing is wrong with your pack and unpack. I met the same issue that model just didn’t learn anything and loss didn’t go down.

In my case, I finally found that my lr was set too large (lr=1e-3). When set lr=1e-5, and applied update with step, my model start learn something after step>60.

so, I don’t know the value of lr in your model. Do you attemp let lr down to smaller?

thanks for the reply!
Indeed I finally found out that the padding works correctly and that it was due to the hyperparameter settings that the network did not learn anything for batch sizes higher than 1.
I originally used the same hyperparameters as in a Theano implementation and therefore assumed that the same hyperparameters would also work for my PyTorch implementation.
For me, I had to add gradient clipping with a threshold of 2.0 (previously I had no gradient clipping) for the network to learn with lr=1e-3 and batch sizes between 1 and 64 (did not test higher).
lr=1e-4 would work without gradient clipping for a batch size of 1 and 10 with and without gradient clipping but it would not work with higher batch sizes (again with and without gradient clipping).
But I think these results are very task-dependent.
The important thing I learned from this is to also consider the hyperparameters if my model fails to learn for higher batch sizes.

I tried this model too and the ‘bug’ i found is that when you create minibatches you sort the inputs with respect to their lengths, changing the order they are in.
My batch function is a bit different because it takes as param a list of indexes not inputs.
I fixed this like this(vec_q are my inputs):

vec_q, lengths, target, rindx = zip(*sorted(zip(vec_q, lengths, target, indexes), key=operator.itemgetter(1), reverse=True))

And then also returned rindx so i can use it later to put the predictions of the model where they should be.

1 Like

Thanks for highlighting the problem! I encounter the same problem as well. In fact I created a thread in this forum to ask about the problem and I didn’t know that the pack_padded_sequence is the culprit. Once I removed that, the model is learning well!

The problem does not lies with the pack_padded_sequence, but with the problem madsxcva has said. When I sort my x tensor, I didn’t apply the same sorting seq to y tensor so the training is not correct!