About the variable length input in RNN scenario

I have some working code (runs and learns) that uses an nn.LSTM for text classification. I tried modifying my code to work with packed sequences, and while it runs, the loss no longer decreases (just stays the same). Only two modifications were made:

FIRST: I sort the data (B, T, D) and sequence lengths (both LongTensors) before passing them to Variable with the following function:

def sort_batch(data, seq_len):
    batch_size = data.size(0)
    sorted_seq_len, sorted_idx = seq_len.sort()
    reverse_idx = torch.linspace(batch_size-1,0,batch_size).long()
    sorted_seq_len = sorted_seq_len[reverse_idx]
    sorted_data = data[sorted_idx][reverse_idx]
    return sorted_data, sorted_seq_len

SECOND: I modified the forward function in the model code from the word_language_model pytorch example. For padded sequences I used:

def forward(self, input, hidden):
    emb = self.encoder(input)
    output, hidden = self.rnn(emb, hidden)
    # Take the output at the final time step
    decoded = self.decoder(output[:,-1,:].squeeze())
    return F.log_softmax(decoded), hidden

And for the variable length sequences I used:

def forward(self, input, seq_len, hidden):
    emb = self.encoder(input)
    emb = pack_padded_sequence(emb, list(seq_len.data), batch_first=True)
    output, hidden = self.rnn(emb, hidden)
    output, _ = pad_packed_sequence(output, batch_first=True)
    # Index of the last output for each sequence.
    idx = (seq_len-1).view(-1,1).expand(output.size(0), output.size(2)).unsqueeze(1)
    decoded = self.decoder(output.gather(1, idx).squeeze())
    return F.log_softmax(decoded), hidden

I believe each addition is implemented correctly, so I thought maybe there’s something more fundamental I’m missing about Variables or forward, or perhaps I’m not using pack_padded_sequence correctly. Thanks in advance, and great job to everyone who’s working hard on PyTorch. It’s really terrific.


That’s weird, we did the grad checks for it and your code seems to be fine. Could you get me a script that could reproduce the problem? Did you try to compare the grads from both versions?

Actually, in the original forward you seem to be taking the last time step, as if all the sequences had the same length, is that expected?

Good call on the original forward. They aren’t all the same length. That was just my first lazy pass before I thought to use gather to get the actual final output. I just ran the non-packed version with gather and it still works. I’ll look at the grads now. Is there a recommended way of doing this in pytorch? Using gradcheck.py?

Yeah, I think gradcheck should work.

@apaszke any update improving performance of chunk? I’m surprised it can become a bottleneck, since couldn’t views into the original tensor be returned?

1 Like

I guess if chunked tensors need to be cat’d later, then copies are needed anyway. Perhaps there is still a way to make chunk faster!

Chunk returns views, but it’s implemented in Python and that’s probably what makes it slow

1 Like

Hi scoinea,

Have you solved your problem? I met the same situation (the model does not converge) as you after using your code. I am not sure how to make it work.