I hope this subject was not already created.
I’m currently implementing IMPALA (https://arxiv.org/abs/1802.01561), a deep reinforcement learning method that aims to be distributed without suffering from the same bottlenecks as A3C (no GPU) or GA3C (instability). It basically collects trajectories and applies the v-trace algo to perform importance sampling.
It also uses an LSTM, and that’s where i’m struggling! I chose to have same length trajectories, but with a mask that tells when the LSTMs states should be reset, and that’s the implementation I decided to use :
x_out =  for i in range(seq_len): result, lstm_hxs = self.model.lstm(x[i], lstm_hxs) lstm_hxs = [(done_mask[:, :, i]*state) for state in lstm_hxs] x_out.append(result) x = torch.stack(tensors=x_results, dim=0)
I though of other methods, such as having different lengths episodes (the LSTM state is reset to zero once an episode is finished) and do padding, but I don’t know which approach would be the best…
Thanks a lot !