Weird behavior of packed padded sequence (implementing recurrent policy for PPO)

Hello everybody!

I’m implementing a recurrent model for training my RL agent with PPO and now I’m concerned with arranging my training data into sequences.
After sampling training data from the current policy, I split the data into episodes and then into sequences.
Next I pad the input data of the model (i.e. agent’s observations) to ensure sequences of fixed length.
The other training data (e.g. log probabilities, advantages, …), that are used for computing the losses, remain unpadded.

Once I feed the model with data, the padded input is propagated through a few conv layers. The resulting output is then used to create a packed sequence using pack_padded_sequence(sequences, lengths, enforce_sorted=False), which is fed to one LSTM layer.

Now it is uncertain to me whether I have to call the inverse function pad_packed_sequence(packed_lstm_output). By accessing packed_lstm_output.data I’d receive a tensor that does not contain the zero pads anymore. Plus the size of packed_lstm_output.data would conform the size of the remaining training data. However, this way the agent is not able to solve any RL environment. The entropy stays constant and the loss is pretty small.

Basically, I’d like to know if the order of the packed_lstm_output.data is different from the original order before creating a packed sequence.
Or is the correct way to convert all of my training data to packed padded sequences?
Or is there some way to remove the zero pads once I call the inverse function pad_packed_sequence(packed_lstm_output)?

I believe that the order of my model’s output is not the same as the other training data (e.g. log probabilities, advantages, …).

# Reshape the to be fed data to sequence_length, batch_size, data
h_shape = tuple(h.size())
h = h.view(sequence_length, (h_shape[0] // sequence_length), h_shape[1])
# Convert data to packed padded sequence
packed_sequences = pack_padded_sequence(h, actual_sequence_lengths, enforce_sorted=False)
# Initialize hidden states to zero
hxs = torch.zeros((h_shape[0] // sequence_length), self.hidden_state_size, dtype=torch.float32, device=device, requires_grad=False)
packed_h, hxs = self.gru(packed_sequences, hxs.unsqueeze(0))
h = packed_h.data
# feed further hidden layers untill reaching the policy and the value head...
1 Like

I wrote a debugging environment. Its observation space is one number in the range of (1,12). Each step the next number is returned as observation. The episode is done, once the end of this range is reached.

After sampling observations, I padded and packed the data. I left a copy of the observations untouched for comparison.

This is the output of the packed_padded_sequence:

[ 9.,  1.,  3., 11.,  5.,  7., 10.,  6.,  8.,  2.,  4.,  0.,  5.,  0.,
         0.,  7.,  1.,  3.,  6.,  2.,  4.,  0.,  0.,  1.,  9., 11.,  3.,  0.,
         0.,  2.,  0.,  0., 10.,  0.,  0.,  5.,  7.,  0.,  9., 11.,  9.,  1.,
         3., 11.,  5.,  7.,  5.,  0.,  0.,  7.,  1.,  3.,  1.,  9., 11.,  3.,
         0.,  0.,  0.,  5.,  7.,  0.,  9.,  0.]

And this is what I expected:

[ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.,  1.,  2.,  3.,
         4.,  5.,  6.,  7.,  8.,  9., 10., 11.,  1.,  2.,  3.,  4.,  5.,  6.,
         7.,  8.,  9., 10., 11.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.,
        10., 11.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.,  1.,
         2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.]

Two major conflicting observations that can be made:

  • There are zeros in the data?!
  • The data is not in order

Before packing the data, I call these lines:

h_shape = tuple(h.size())
h = h.view(sequence_length, (h_shape[0] // sequence_length), h_shape[1])
# Convert data to packed padded sequence
packed = pack_padded_sequence(h, actual_sequence_lengths, enforce_sorted=False)
print(packed.data.view(-1))

The reshaping of the data is done, because the data is flattened in order to feed an entire batch to the first layers of the model.

Why do you guys think are zeros in the packed_data.data ?
Concerning the order I could imagine that the call of the view function does not work like the reshape function in PyTorch.

While printing multiple steps in-between, I came down to the conclusion that the packed padded sequence is to blame. Does anybody know what could have gone wrong with the packed padded sequence?

Do I have to sort the sequences by length?

What padding does the packed sequence expect? My padding is done at the end of the sequence.

The solution for me is to not used Packed Padded Sequences. The behavior of that object is very suspicious.

Now, I pad all the training data and then mask the loss.