Understanding pack_padded_sequence and pad_packed_sequence

vijendra_rana · June 18, 2017, 2:17am

Hi,
I have a problem understanding these 2 utilities. Not able to figure out what it does.

For eg. I was trying to replicate this with example from Simple working example how to use packing for variable-length sequence inputs for rnn

I have followed the pytorch documentation and coded with batch First

import torch
import torch.nn as nn
from torch.autograd import Variable

batch_size = 3
max_length = 3
hidden_size = 2
n_layers =1
num_input_features = 1
input_tensor = torch.zeros(batch_size,max_length,num_input_features)
input_tensor[0] = torch.FloatTensor([1,2,3])
input_tensor[1] = torch.FloatTensor([4,5,0])
input_tensor[2] = torch.FloatTensor([6,0,0])
batch_in = Variable(input_tensor)
seq_lengths = [3,2,1]
pack = torch.nn.utils.rnn.pack_padded_sequence(batch_in, seq_lengths, batch_first=True)
print (pack)

Here I get output as

PackedSequence(data=Variable containing:
 1
 4
 6
 2
 5
 3
[torch.FloatTensor of size 6x1]
, batch_sizes=[3, 2, 1])

I could retrieve the original sequence back if I do

torch.nn.utils.rnn.pad_packed_sequence(pack,[3,2,1])

which is obvious.

But can somebody help me to understand how and why we got that output ‘pack’ with size (6,1). Also the whole functionality, in general, I mean why we need these 2 utilities and how it is useful.
Thanks in advance for the help.

Cheers,
Vijendra

sinhasam · July 27, 2017, 3:15pm

They are used for seq to seq models with variable lengths. Such as a sentence can be of variable length, and to feed it into any class of an RNN, you need to be able to get the output at the right time step.

Furthermore, check your dimensions on, and compare them to the original link:

You should be getting a 3x3 PackedSequence, but you just have a small bug in the code, which is why you are getting a 6x1 PackedSequence.

matthew_zeng · July 28, 2017, 4:03pm

Actually the original example set a wrong shape for input, that’s why you get 6x1 instead of 3x3.

The original example set shape as [batch_size, 1, max_length], which is obviously wrong

Without packing, you have to unroll your RNN to a fixed length. Then, you will get a fixed length of output. The desired output should have different length, so you have to mask them by yourself.

if you feed your minibatch with packing, the output of each example will have different length, so you don’t have to mask something by yourself.

moonlightlane · August 31, 2017, 7:18pm

Hi there, thanks for the clarification! I also saw these two posts and compare then and can confirm the original post has the wrong dimension.

I am not sure if I understand how pytorch RNN operates, though: for example, I don’t necessary need to use pack_padded_sequence, correct? I can simply manually zero pad all sequences in a minibatch to the longest sequence, and then throw it into the RNN, which accepts input of dimension [seq_len, batch, input_size]? I think doing so (manually padding each sequence) is the same as use the pack_padded_sequence function, correct?

Thank you in advance for a further clarification!

matthew_zeng · September 1, 2017, 12:18am

Right, you don’t have to use pack_padded_sequence. Padding is fine, but it is different from using pack_padded_seq. For packed input, RNN will not perform calculation on pad elements.

For example, you have a padded mini batch (size 2), zero is padding.

1 1 1
1 0 0

The output will be 3 (seq length) x 2 (batch size). However, packed input will result in a packed output contains (3 x 1 and 1 x 1). If you feed pack into RNN, it will not calculate output for your pad element. Moreover, hidden will be the hidden after the last valid input instead of hidden after the last zero padding (if you feed pad into rnn, hidden will be in the case).

RNN actually does not distinguish pad and valid elements, and it performs the same calculation on them. You may need to clean the output (e.g., mask output) to get the result you want. Dynamic RNN (feed with packed input) does not have this problem.

sritvik · October 3, 2017, 5:26pm

Getting a similar issue here,

the inputs provided for pack_padded_sequence: sent, sent_len.
Where sent is the input (batch_size, seq_length, features/embedding_dim), with dimension [torch.FloatTensor of size 100x16x200]
and sent_len is a list containing the actual (unpadded) lengths of all sequences in the batch.

print sent.size() # (100L, 16L, 200L), i.e., (batch size, padded sequence length,embedding dimension / feature)
print len(sent_len) # 100, i.e., the length of the list containing sequence lengths of each element in the batch

sent_packed = nn.utils.rnn.pack_padded_sequence(sent, sent_len,batch_first=True)
print sent_packed

Now the output of sent_packed is expected to have a dimension same as sent, right?
I am getting the following:

PackedSequence(data=Variable containing:
 0.1084  0.3546  0.4458  ...  -1.1613  1.1618  0.4275
 0.0564 -1.0614  0.1452  ...  -0.7359 -0.2980 -1.9538
 0.8342  0.2849  0.5471  ...  -0.9297 -0.3760  0.4382
          ...             â±             ...          
 0.5107  1.6905  0.3308  ...  -0.8220  0.7505 -0.9616
 1.5038  0.3528 -1.4010  ...  -0.9663 -0.7744 -0.5839
-0.7513  1.6879 -0.1883  ...   1.1898  0.5734  0.1458
[torch.FloatTensor of size 818x200]
, batch_sizes=[100, 100, 100, 100, 100, 99, 77, 58, 35, 25, 11, 5, 3, 2, 2, 1])

Any idea what’s the bug?

Suthee · October 21, 2017, 7:44am

PackedSequence(data=Variable containing:
1
4
6
2
5
3
[torch.FloatTensor of size 6x1]
, batch_sizes=[3, 2, 1])

I thought the output is expected and correct.
The feature dimension is 1, so each row of 6x1 Tensor represents one token of the given sequence. There are 6 tokens total and 3 sequences. Then, batch_sizes = [3,2,1] also makes sense because the first iteration to RNN should contain the first tokens of all 3 sequences ( which is [1, 4, 6]). Then for the next iterations, batch size of 2 implies the second tokens out of 3 sequences which is [2, 5] because the last sequence has a length of 1.

Did I misunderstand something here? Thanks!

Suthee · October 21, 2017, 7:49am

I am new here so I might be wrong. I thought this is a correct behavior. There are 818 tokens/elements and each has 200 dimensions.

np.sum([100, 100, 100, 100, 100, 99, 77, 58, 35, 25, 11, 5, 3, 2, 2, 1]) is 818.

11177 · April 23, 2018, 2:44am

You are absolutely right!

sitara_J · July 15, 2018, 8:08am

I run your codes but I didn’t get the output ,instead I got the error
“The expanded size of the tensor (1) must match the existing size (3) at non-singleton dimension 1”
Can anybody tell me why and how to fix it? thank you!

hughperkins · July 15, 2018, 12:56pm

I get that error too. I rewrote it slightly, and got rid of the error, but I havent double checked whether it’s diong what it’s supposed to be doing. No error message though

import torch
import torch.nn as nn


batch_size = 3
max_length = 3

batch_in = torch.LongTensor(batch_size, max_length).zero_()
batch_in[0] = torch.LongTensor([1, 2, 3])
batch_in[1] = torch.LongTensor([4, 5, 0])
batch_in[2] = torch.LongTensor([6, 0, 0])
seq_lengths = [3, 2, 1]
pack = torch.nn.utils.rnn.pack_padded_sequence(batch_in, seq_lengths, batch_first=True)
print(pack)

output:

PackedSequence(data=tensor([ 1,  4,  6,  2,  5,  3]), batch_sizes=tensor([ 3,  2,  1]))

Revo_Let · September 11, 2018, 12:59pm

Is there an effect of choosing to use PackedSequence over manual padding in terms of loss/accuracy? Suppose I padded all my sequences with 0’s to a fixed length. During training I retrieve the output from the last hidden unit in an LSTM/RNN for prediction (in a many-to-one fashion.) In many of the samples, a bunch of meaningless 0’s have been incorporated in the prediction. Should this hurt accuracy? It sounds so, but I can’t really prove it…

matthew_zeng · November 8, 2018, 8:58am

Padding has an effect on hidden. However, in many cases (e.g., the length of input are quite close) the effect is quite small.

If a batch consists of a very long sequence and some very short sequences, then the short ones have to be padded with many zeros. Then, performing RNN on the short sequences which are padded with zeros will be equal to performing RNN on zero sequences, right?

Suzyahyah · July 19, 2019, 2:39pm

Padding: Standardises variable length sequence

Packing: Format for RNN to ignore the “pads”. Note that we feed the original length (before padding) as input to the pack_pad_sequence function.

The whole sequence is

pad
embed
pack_padded
– [rnn] -->
pad_packed
eval

The second pad_packed is basically an “unpack”.

Here is a minimal working example with some explanation, hope it helps.
https://suzyahyah.github.io/pytorch/2019/07/01/DataLoader-Pad-Pack-Sequence.html

tpatel0409 · September 24, 2019, 6:54am

One Basic doubt here,
Isn’t the input already Zero padded here ?
I think, The only thing which pack padded sequence gonna do is pack the data so that the dynamic unrolling of the RNN happens accordingly and the extra zero padded sequences aren’t gonna be used in forward pass.

ironv · October 2, 2019, 4:01am

Thanks for posting the example on your blog (…DataLoader-Pad-Pack-Sequence.html), I found it extremely useful. One of the things which is seemingly glossed over is the initialization of the hidden state

output_packed, hidden = rnn(x_packed, hidden)

can you please elaborate on how to initialize the hidden state. Should it be done

beginning of every minibatch?
beginning of every epoch?
values set to 0 or random values?
should the hidden state be explicitly saved and passed back in?

Thx.

Suzyahyah · October 3, 2019, 7:29pm

Cheers, I’m really glad it helped! (and apologies to the mods and other readers for the cross post)

If each data instance is not related e.g, random shuffled, it is more appropriate to initialize the hidden state each time at the start of the sequence.

Beginning of every minibatch: This implies that we are using the hidden state from the previous sentence/sequence to initialise the next sequence. We might want to do this if the instances in the minibatch are related. Maybe a minibatch is a paragraph of sentences?
Beginning of every epoch: Same as above, except on a larger scale.
Values set to 0 or random: Both are valid approaches. There is less risk of overfitting if the hidden state is randomly initialized, as compared to 0 all the time. Having said that there are “stronger” ways of preventing overfitting like dropout which is tunable. So I would think if your goal is to squeeze the last bit of performance, try both. If you are implementing for research, just go with 0.

There are also more advanced methods like learning the initial hidden state, which is an ongoing area of research.

Pratheesh_Kumar · September 22, 2020, 5:51am

In case of a chatbot application where I feed batch of inputs in that case how do I initialize the hidden state(inital)