LSTM for RL with batching

Hi everyone,

I hope this subject was not already created.

I’m currently implementing IMPALA (https://arxiv.org/abs/1802.01561), a deep reinforcement learning method that aims to be distributed without suffering from the same bottlenecks as A3C (no GPU) or GA3C (instability). It basically collects trajectories and applies the v-trace algo to perform importance sampling.
It also uses an LSTM, and that’s where i’m struggling! I chose to have same length trajectories, but with a mask that tells when the LSTMs states should be reset, and that’s the implementation I decided to use :

x_out = []

for i in range(seq_len):
    result, lstm_hxs = self.model.lstm(x[i], lstm_hxs)
    lstm_hxs = [(done_mask[:, :, i]*state) for state in lstm_hxs]
    x_out.append(result)

x = torch.stack(tensors=x_results, dim=0)

I though of other methods, such as having different lengths episodes (the LSTM state is reset to zero once an episode is finished) and do padding, but I don’t know which approach would be the best…

Thanks a lot !