LSTM error while trying to update the hidden state

I am trying to train an LSTM while keeping its hidden state (LSTM stateful) until the moment when I am going to start a new epoch(episode). But here it’s come an interesting situation because I am getting the following error while trying to do so:

RuntimeError: one of the variables needed for gradient computation has
been modified by an inplace operation: [torch.cuda.FloatTensor [705,
25]] is at version 3; expected version 2 instead. Hint: the backtrace
further above shows the operation that failed to compute its gradient.
The variable in question was changed in there or anywhere later. Good

And this is happening just if I am trying to keep my hidden state if, for example, I will use a thin line instead:

dist, _ =, h_out) and also removing retain_graph=True everything is going to work just fine.

Could any of you help me to understand what is going on here and how can I fix this, please?

I have my training loop here:

for ep in range(conf.num_episode):
    state = env.reset()
    step = 0

    qnet_agent.hidden = None
    qnet_agent.hidden_2 = None
    while True:
        step += 1
        frames_total += 1

        epsilon = calculate_epsilon(frames_total)

        action, smart_decision = qnet_agent.select_action(state, epsilon)

        new_state, reward, done, info = env.step(action)

        memory.push(state, action, new_state, reward, done)

        state = new_state

        if done:

And here is my optimize function:

 def optimize(self):
    if len(self.memory) < self.config.batch_size:

    state, action, new_state, reward, done = self.memory.sample(batch_size=self.config.batch_size)

    state = torch.Tensor(np.array(state)).to(device)
    new_state = torch.Tensor(np.array(new_state)).to(device)
    reward = torch.Tensor(reward).to(device)
    action = torch.LongTensor(action).to(device)
    done = torch.Tensor(done).to(device)

    h_out = self.hidden
    dist, self.hidden =, h_out)
    dist = torch.distributions.Categorical(dist)

    advantage = reward + (1 - done) * self.config.gamma * self.critic(new_state).squeeze(1) - self.critic(state).squeeze(1)

    critic_loss = advantage.pow(2).mean()

    actor_loss = -dist.log_prob(action) * advantage.detach()