I am trying to train an LSTM while keeping its hidden state (LSTM stateful) until the moment when I am going to start a new epoch(episode). But here it’s come an interesting situation because I am getting the following error while trying to do so:
RuntimeError: one of the variables needed for gradient computation has
been modified by an inplace operation: [torch.cuda.FloatTensor [705,
25]] is at version 3; expected version 2 instead. Hint: the backtrace
further above shows the operation that failed to compute its gradient.
The variable in question was changed in there or anywhere later. Good
luck!
And this is happening just if I am trying to keep my hidden state if, for example, I will use a thin line instead:
dist, _ = self.actor(state, h_out)
and also removing retain_graph=True
everything is going to work just fine.
Could any of you help me to understand what is going on here and how can I fix this, please?
I have my training loop here:
for ep in range(conf.num_episode):
state = env.reset()
step = 0
qnet_agent.hidden = None
qnet_agent.hidden_2 = None
while True:
step += 1
frames_total += 1
epsilon = calculate_epsilon(frames_total)
action, smart_decision = qnet_agent.select_action(state, epsilon)
new_state, reward, done, info = env.step(action)
memory.push(state, action, new_state, reward, done)
qnet_agent.optimize()
state = new_state
if done:
steps_total.append(step)
break
And here is my optimize function:
def optimize(self):
if len(self.memory) < self.config.batch_size:
return
state, action, new_state, reward, done = self.memory.sample(batch_size=self.config.batch_size)
state = torch.Tensor(np.array(state)).to(device)
new_state = torch.Tensor(np.array(new_state)).to(device)
reward = torch.Tensor(reward).to(device)
action = torch.LongTensor(action).to(device)
done = torch.Tensor(done).to(device)
h_out = self.hidden
dist, self.hidden = self.actor(state, h_out)
dist = torch.distributions.Categorical(dist)
advantage = reward + (1 - done) * self.config.gamma * self.critic(new_state).squeeze(1) - self.critic(state).squeeze(1)
critic_loss = advantage.pow(2).mean()
self.optimizer_critic.zero_grad()
critic_loss.backward()
self.optimizer_critic.step()
actor_loss = -dist.log_prob(action) * advantage.detach()
self.optimizer_actor.zero_grad()
actor_loss.mean().backward(retain_graph=True)
self.optimizer_actor.step()