I am trying to train an LSTM while keeping its hidden state (LSTM stateful) until the moment when I am going to start a new epoch(episode). But here it’s come an interesting situation because I am getting the following error while trying to do so:
RuntimeError: one of the variables needed for gradient computation has
been modified by an inplace operation: [torch.cuda.FloatTensor [705,
25]] is at version 3; expected version 2 instead. Hint: the backtrace
further above shows the operation that failed to compute its gradient.
The variable in question was changed in there or anywhere later. Good
And this is happening just if I am trying to keep my hidden state if, for example, I will use a thin line instead:
dist, _ = self.actor(state, h_out) and also removing
retain_graph=True everything is going to work just fine.
Could any of you help me to understand what is going on here and how can I fix this, please?
I have my training loop here:
for ep in range(conf.num_episode): state = env.reset() step = 0 qnet_agent.hidden = None qnet_agent.hidden_2 = None while True: step += 1 frames_total += 1 epsilon = calculate_epsilon(frames_total) action, smart_decision = qnet_agent.select_action(state, epsilon) new_state, reward, done, info = env.step(action) memory.push(state, action, new_state, reward, done) qnet_agent.optimize() state = new_state if done: steps_total.append(step) break
And here is my optimize function:
def optimize(self): if len(self.memory) < self.config.batch_size: return state, action, new_state, reward, done = self.memory.sample(batch_size=self.config.batch_size) state = torch.Tensor(np.array(state)).to(device) new_state = torch.Tensor(np.array(new_state)).to(device) reward = torch.Tensor(reward).to(device) action = torch.LongTensor(action).to(device) done = torch.Tensor(done).to(device) h_out = self.hidden dist, self.hidden = self.actor(state, h_out) dist = torch.distributions.Categorical(dist) advantage = reward + (1 - done) * self.config.gamma * self.critic(new_state).squeeze(1) - self.critic(state).squeeze(1) critic_loss = advantage.pow(2).mean() self.optimizer_critic.zero_grad() critic_loss.backward() self.optimizer_critic.step() actor_loss = -dist.log_prob(action) * advantage.detach() self.optimizer_actor.zero_grad() actor_loss.mean().backward(retain_graph=True) self.optimizer_actor.step()