Understanding pack_padded_sequence and pad_packed_sequence

Hi,
I have a problem understanding these 2 utilities. Not able to figure out what it does.

For eg. I was trying to replicate this with example from Simple working example how to use packing for variable-length sequence inputs for rnn

I have followed the pytorch documentation and coded with batch First

import torch
import torch.nn as nn
from torch.autograd import Variable

batch_size = 3
max_length = 3
hidden_size = 2
n_layers =1
num_input_features = 1
input_tensor = torch.zeros(batch_size,max_length,num_input_features)
input_tensor[0] = torch.FloatTensor([1,2,3])
input_tensor[1] = torch.FloatTensor([4,5,0])
input_tensor[2] = torch.FloatTensor([6,0,0])
batch_in = Variable(input_tensor)
seq_lengths = [3,2,1]
pack = torch.nn.utils.rnn.pack_padded_sequence(batch_in, seq_lengths, batch_first=True)
print (pack)

Here I get output as

PackedSequence(data=Variable containing:
 1
 4
 6
 2
 5
 3
[torch.FloatTensor of size 6x1]
, batch_sizes=[3, 2, 1])

I could retrieve the original sequence back if I do

torch.nn.utils.rnn.pad_packed_sequence(pack,[3,2,1])

which is obvious.

But can somebody help me to understand how and why we got that output ā€˜packā€™ with size (6,1). Also the whole functionality, in general, I mean why we need these 2 utilities and how it is useful.
Thanks in advance for the help.

Cheers,
Vijendra

9 Likes

They are used for seq to seq models with variable lengths. Such as a sentence can be of variable length, and to feed it into any class of an RNN, you need to be able to get the output at the right time step.

Furthermore, check your dimensions on, and compare them to the original link:

You should be getting a 3x3 PackedSequence, but you just have a small bug in the code, which is why you are getting a 6x1 PackedSequence.

Actually the original example set a wrong shape for input, thatā€™s why you get 6x1 instead of 3x3.

The original example set shape as [batch_size, 1, max_length], which is obviously wrong

Without packing, you have to unroll your RNN to a fixed length. Then, you will get a fixed length of output. The desired output should have different length, so you have to mask them by yourself.

if you feed your minibatch with packing, the output of each example will have different length, so you donā€™t have to mask something by yourself.

2 Likes

Hi there, thanks for the clarification! I also saw these two posts and compare then and can confirm the original post has the wrong dimension.

I am not sure if I understand how pytorch RNN operates, though: for example, I donā€™t necessary need to use pack_padded_sequence, correct? I can simply manually zero pad all sequences in a minibatch to the longest sequence, and then throw it into the RNN, which accepts input of dimension [seq_len, batch, input_size]? I think doing so (manually padding each sequence) is the same as use the pack_padded_sequence function, correct?

Thank you in advance for a further clarification!

1 Like

Right, you donā€™t have to use pack_padded_sequence. Padding is fine, but it is different from using pack_padded_seq. For packed input, RNN will not perform calculation on pad elements.

For example, you have a padded mini batch (size 2), zero is padding.

1 1 1
1 0 0

The output will be 3 (seq length) x 2 (batch size). However, packed input will result in a packed output contains (3 x 1 and 1 x 1). If you feed pack into RNN, it will not calculate output for your pad element. Moreover, hidden will be the hidden after the last valid input instead of hidden after the last zero padding (if you feed pad into rnn, hidden will be in the case).

RNN actually does not distinguish pad and valid elements, and it performs the same calculation on them. You may need to clean the output (e.g., mask output) to get the result you want. Dynamic RNN (feed with packed input) does not have this problem.

23 Likes

Getting a similar issue here,

the inputs provided for pack_padded_sequence: sent, sent_len.
Where sent is the input (batch_size, seq_length, features/embedding_dim), with dimension [torch.FloatTensor of size 100x16x200]
and sent_len is a list containing the actual (unpadded) lengths of all sequences in the batch.

print sent.size() # (100L, 16L, 200L), i.e., (batch size, padded sequence length,embedding dimension / feature)
print len(sent_len) # 100, i.e., the length of the list containing sequence lengths of each element in the batch

sent_packed = nn.utils.rnn.pack_padded_sequence(sent, sent_len,batch_first=True)
print sent_packed

Now the output of sent_packed is expected to have a dimension same as sent, right?
I am getting the following:

PackedSequence(data=Variable containing:
 0.1084  0.3546  0.4458  ...  -1.1613  1.1618  0.4275
 0.0564 -1.0614  0.1452  ...  -0.7359 -0.2980 -1.9538
 0.8342  0.2849  0.5471  ...  -0.9297 -0.3760  0.4382
          ...             Ć¢Ā±             ...          
 0.5107  1.6905  0.3308  ...  -0.8220  0.7505 -0.9616
 1.5038  0.3528 -1.4010  ...  -0.9663 -0.7744 -0.5839
-0.7513  1.6879 -0.1883  ...   1.1898  0.5734  0.1458
[torch.FloatTensor of size 818x200]
, batch_sizes=[100, 100, 100, 100, 100, 99, 77, 58, 35, 25, 11, 5, 3, 2, 2, 1])

Any idea whatā€™s the bug?

2 Likes

PackedSequence(data=Variable containing:
1
4
6
2
5
3
[torch.FloatTensor of size 6x1]
, batch_sizes=[3, 2, 1])

I thought the output is expected and correct.
The feature dimension is 1, so each row of 6x1 Tensor represents one token of the given sequence. There are 6 tokens total and 3 sequences. Then, batch_sizes = [3,2,1] also makes sense because the first iteration to RNN should contain the first tokens of all 3 sequences ( which is [1, 4, 6]). Then for the next iterations, batch size of 2 implies the second tokens out of 3 sequences which is [2, 5] because the last sequence has a length of 1.

Did I misunderstand something here? Thanks!

6 Likes

I am new here so I might be wrong. I thought this is a correct behavior. There are 818 tokens/elements and each has 200 dimensions.

np.sum([100, 100, 100, 100, 100, 99, 77, 58, 35, 25, 11, 5, 3, 2, 2, 1]) is 818.

1 Like

You are absolutely right!

I run your codes but I didnā€™t get the output ,instead I got the error
ā€œThe expanded size of the tensor (1) must match the existing size (3) at non-singleton dimension 1ā€
Can anybody tell me why and how to fix it? thank you!

I get that error too. I rewrote it slightly, and got rid of the error, but I havent double checked whether itā€™s diong what itā€™s supposed to be doing. No error message though :slight_smile:

import torch
import torch.nn as nn


batch_size = 3
max_length = 3

batch_in = torch.LongTensor(batch_size, max_length).zero_()
batch_in[0] = torch.LongTensor([1, 2, 3])
batch_in[1] = torch.LongTensor([4, 5, 0])
batch_in[2] = torch.LongTensor([6, 0, 0])
seq_lengths = [3, 2, 1]
pack = torch.nn.utils.rnn.pack_padded_sequence(batch_in, seq_lengths, batch_first=True)
print(pack)

output:

PackedSequence(data=tensor([ 1,  4,  6,  2,  5,  3]), batch_sizes=tensor([ 3,  2,  1]))

Is there an effect of choosing to use PackedSequence over manual padding in terms of loss/accuracy? Suppose I padded all my sequences with 0ā€™s to a fixed length. During training I retrieve the output from the last hidden unit in an LSTM/RNN for prediction (in a many-to-one fashion.) In many of the samples, a bunch of meaningless 0ā€™s have been incorporated in the prediction. Should this hurt accuracy? It sounds so, but I canā€™t really prove itā€¦

Padding has an effect on hidden. However, in many cases (e.g., the length of input are quite close) the effect is quite small.

If a batch consists of a very long sequence and some very short sequences, then the short ones have to be padded with many zeros. Then, performing RNN on the short sequences which are padded with zeros will be equal to performing RNN on zero sequences, right?

Padding: Standardises variable length sequence

Packing: Format for RNN to ignore the ā€œpadsā€. Note that we feed the original length (before padding) as input to the pack_pad_sequence function.

The whole sequence is

  1. pad
  2. embed
  3. pack_padded
    ā€“ [rnn] -->
  4. pad_packed
  5. eval

The second pad_packed is basically an ā€œunpackā€.

Here is a minimal working example with some explanation, hope it helps.
https://suzyahyah.github.io/pytorch/2019/07/01/DataLoader-Pad-Pack-Sequence.html

5 Likes

One Basic doubt here,
Isnā€™t the input already Zero padded here ?
I think, The only thing which pack padded sequence gonna do is pack the data so that the dynamic unrolling of the RNN happens accordingly and the extra zero padded sequences arenā€™t gonna be used in forward pass.

Thanks for posting the example on your blog (ā€¦DataLoader-Pad-Pack-Sequence.html), I found it extremely useful. One of the things which is seemingly glossed over is the initialization of the hidden state

output_packed, hidden = rnn(x_packed, hidden)

can you please elaborate on how to initialize the hidden state. Should it be done

  • beginning of every minibatch?
  • beginning of every epoch?
  • values set to 0 or random values?
  • should the hidden state be explicitly saved and passed back in?

Thx.

Cheers, Iā€™m really glad it helped! (and apologies to the mods and other readers for the cross post)

If each data instance is not related e.g, random shuffled, it is more appropriate to initialize the hidden state each time at the start of the sequence.

  • Beginning of every minibatch: This implies that we are using the hidden state from the previous sentence/sequence to initialise the next sequence. We might want to do this if the instances in the minibatch are related. Maybe a minibatch is a paragraph of sentences?

  • Beginning of every epoch: Same as above, except on a larger scale.

  • Values set to 0 or random: Both are valid approaches. There is less risk of overfitting if the hidden state is randomly initialized, as compared to 0 all the time. Having said that there are ā€œstrongerā€ ways of preventing overfitting like dropout which is tunable. So I would think if your goal is to squeeze the last bit of performance, try both. If you are implementing for research, just go with 0.

There are also more advanced methods like learning the initial hidden state, which is an ongoing area of research.

1 Like

In case of a chatbot application where I feed batch of inputs in that case how do I initialize the hidden state(inital)