Hello everybody!
I’m implementing a recurrent model for training my RL agent with PPO and now I’m concerned with arranging my training data into sequences.
After sampling training data from the current policy, I split the data into episodes and then into sequences.
Next I pad the input data of the model (i.e. agent’s observations) to ensure sequences of fixed length.
The other training data (e.g. log probabilities, advantages, …), that are used for computing the losses, remain unpadded.
Once I feed the model with data, the padded input is propagated through a few conv layers. The resulting output is then used to create a packed sequence using pack_padded_sequence(sequences, lengths, enforce_sorted=False)
, which is fed to one LSTM layer.
Now it is uncertain to me whether I have to call the inverse function pad_packed_sequence(packed_lstm_output)
. By accessing packed_lstm_output.data
I’d receive a tensor that does not contain the zero pads anymore. Plus the size of packed_lstm_output.data
would conform the size of the remaining training data. However, this way the agent is not able to solve any RL environment. The entropy stays constant and the loss is pretty small.
Basically, I’d like to know if the order of the packed_lstm_output.data is different from the original order before creating a packed sequence.
Or is the correct way to convert all of my training data to packed padded sequences?
Or is there some way to remove the zero pads once I call the inverse function pad_packed_sequence(packed_lstm_output)
?
I believe that the order of my model’s output is not the same as the other training data (e.g. log probabilities, advantages, …).
# Reshape the to be fed data to sequence_length, batch_size, data
h_shape = tuple(h.size())
h = h.view(sequence_length, (h_shape[0] // sequence_length), h_shape[1])
# Convert data to packed padded sequence
packed_sequences = pack_padded_sequence(h, actual_sequence_lengths, enforce_sorted=False)
# Initialize hidden states to zero
hxs = torch.zeros((h_shape[0] // sequence_length), self.hidden_state_size, dtype=torch.float32, device=device, requires_grad=False)
packed_h, hxs = self.gru(packed_sequences, hxs.unsqueeze(0))
h = packed_h.data
# feed further hidden layers untill reaching the policy and the value head...