Backpropagation Through Time On LSTM for Reinforcement Learning

Hello!

I am trying to apply a LSTM architecture into a Deep Q-learning model, to be able to perceive effects of the actions through time. I think that my code have some conceptual error, because the agent isn’t learning with time.

On this toy example, the agent is a car that have to ride in a road that have the format of a sin wave and have to stay in the middle of the road. It can take action : go up, go down, or stand still. It can get positive reward, if it stays near the middle, -1 reward if it goes out of the middle and -3 if it hits the wall (if it hits the wall his positionwill be the same of the wall, so he cant get out of the road). The agent if feed with his position and the current time on the loop.
There is a link to the code below:

DQN - Car on sin wave road

To create the agent that interacts with the environment, I developed the code below:

LSTM Agent

I’m a little bit confused with how I perform backpropagation on the actions stored on the experience replay. This awesome post from Arthur Juliani on Medium said that I have to store episodes intead of timesteps to perform backprop on the experience replay with a LSTM architecture. I’ve tried to implement this idea in the code below (that is inside the second link on this post):

 def learn(self):
        # Defining the counter
        i = 0
        # Getting an episode from the memory
        episode = self.memory.sample_episode()
        # Defining the graph lenght for a case where len(self.memory.memory) != 0
        graph_end = len(episode) - 1
        # Loop to perform backprop through time on the episode selected from the memory
        while i < graph_end :
            state, next_state, action, reward = episode[i]
            # Calculate the Q-value for the action taken on the episodes of memory sampled
            outputs = self.model(state).gather(1, action.unsqueeze(1)).squeeze(1)
            # Calculate the Q-value of all possible actions, given the next state and select the bigger to act on a optimal policy Q*
            next_outputs = self.model(next_state).detach().max(1)[0]
            # Calculate the value of the discounted reward and then the actual Q-value of the action taken
            target = self.gamma*next_outputs + reward
            # Calculate the loss
            td_loss = F.smooth_l1_loss(outputs, target) 
            self.optimizer.zero_grad()
            td_loss.backward(retain_graph = True)
            # Aply the optimization step for each value, actualizing the values of the patamethers of the NN
            self.optimizer.step()
            i += 1
        self.model.hx1.detach_()
        self.model.hx2.detach_()
        self.model.cx1.detach_()
        self.model.cx2.detach_()

I`m doing something wrong on this algorithm of TBPTT? There is something about DQN algorithm that I´m not considering for this implementation?