I am training my multi agents reinforcement learning project, and I got an error "Trying to backward through the graph a second time..."

I am trying to run a multi agent reinforcement learning project, and getting the following error:

Traceback (most recent call last):
  File "E:\USER\Desktop\TD3p\V2\main.py", line 162, in <module>
    marl_agents.learn(memory, writer, steps_total)
  File "E:\USER\Desktop\TD3p\V2\matd3.py", line 118, in learn
    self.agents[agent_idx].actor_loss.backward()
  File "E:\anaconda3\envs\pytorch\lib\site-packages\torch\_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "E:\anaconda3\envs\pytorch\lib\site-packages\torch\autograd\__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

My codes are here:

 for agent_idx in range(self.n_agents):
        ...
        # critic loss calculation
        self.agents[agent_idx].critic_loss = F.mse_loss(current_Q1.float(), target_Q.float()) +\
                                             F.mse_loss(current_Q2.float(), target_Q.float())

        # critic optimization
        self.agents[agent_idx].critic.optimizer.zero_grad()
        self.agents[agent_idx].critic_loss.backward()
        self.agents[agent_idx].critic.optimizer.step()

        if steps_total % self.freq == 0 and steps_total > 0:
            # actor loss calculation
            self.agents[agent_idx].actor_loss = -T.mean(self.agents[agent_idx].critic.Q1(states, mu))
            # actor optimization
            self.agents[agent_idx].actor.optimizer.zero_grad()
            self.agents[agent_idx].actor_loss.backward()
            self.agents[agent_idx].actor.optimizer.step()
            self.agents[agent_idx].update_network_parameters()

The error happens on actors’ optimization: self.agents[agent_idx].actor_loss.backward()

Firstly, for each agent, I need to use this backward() function. So, in iterations, this function would definitely be used multiple times. However, I think for each agent, they would use this backward() function independently. So, I do not need to set ‘retain_graph=True’ because for example, the second agent did not need to access the saved variables of the first agent.

Secondly, this problem would only happen on the calculations of actor losses, although both critic losses and actor losses follow the same order of execution: calculate the loss and optimizate. After the previous agent completes the calculation and optimization, the latter agent would execute the code. Critics can call this backward() function multiple times for optimazation, but actiors can not.

I have detached the actor_loss before backpropagation. The modified code works, but the loss graphes are strange: all critics’ losses keep fluctuating, while all actors’ losses keep increasing.

The modified codes are here:

 for agent_idx in range(self.n_agents):
        ...
        # critic loss calculation
        self.agents[agent_idx].critic_loss = F.mse_loss(current_Q1.float(), target_Q.float()) +\
                                             F.mse_loss(current_Q2.float(), target_Q.float())

        # critic optimization
        self.agents[agent_idx].critic.optimizer.zero_grad()
        self.agents[agent_idx].critic_loss.backward()
        self.agents[agent_idx].critic.optimizer.step()

        if steps_total % self.freq == 0 and steps_total > 0:
            # actor loss calculation
            self.agents[agent_idx].actor_loss = self.agents[agent_idx].critic.Q1(states, mu).detach()
            self.agents[agent_idx].actor_loss.requires_grad = True
            self.agents[agent_idx].actor_loss = -T.mean(self.agents[agent_idx].actor_loss)
            # actor optimization
            self.agents[agent_idx].actor.optimizer.zero_grad()
            self.agents[agent_idx].actor_loss.backward()
            self.agents[agent_idx].actor.optimizer.step()
            self.agents[agent_idx].update_network_parameters()

I would appreciate it if anyone could give me some hints or help me fix this bug.

I think my biggest problem is that, I have use backward() function seperately for each agnet, but it still said that I have backwarded through the graph a second time in the actor part. However, I had did similar operations in the critic part, but all critics work well.

Hi @Emanul_Yang
It’s a bit hard to see what is going on without knowing where states and mu come from in the first code snippet.
My guess would be that when you call backward, the parameters used to compute state and / or mu are assigned a gradient and the graph is freed. I’m not sure of what you wish to do so I can’t really advise you on how to fix that bug unfortunately.

In the second things are a bit clearer:
you basically create a loss that you detach, then make it differentiable (but its derivative will be = 1) and then call backward on that (which does nothing but make self.agents[agent_idx].actor_loss.grad = 1 => in other words since you have detached your loss from the graph none of the parameter is involved in its computation anymore).

This is the critic part of each agent.

class CriticNetwork(nn.Module):
...
    def forward(self, states, actions):
        """
        :param state:
        :param action:
        :return: result of the network
        """
        x1 = F.relu(self.fc1(T.cat([states, actions], dim=1)))
        x1 = F.relu(self.fc2(x1))
        q1 = self.q1(x1)

        x2 = F.relu(self.fc1(T.cat([state, actions], dim=1)))
        x2 = F.relu(self.fc2(x2))
        q2 = self.q2(x2)
        return q1.squeeze(-1), q2.squeeze(-1)

    def Q1(self, states, actions):
        """
        :param states:
        :param actions:
        :return: result of the network
        """
        x1 = F.relu(self.fc1(T.cat([states, actions], dim=1)))
        x1 = F.relu(self.fc2(x1))
        q1 = self.q1(x1)
        return q1.squeeze(-1)
...

This is the actor part of each agent.

class ActorNetwork(nn.Module):
    def forward(self, states):
        """
        :param state:
        :return: result of the network
        """
        x = F.leaky_relu(self.fc1(states))
        x = F.leaky_relu(self.fc2(x))
          # output range (-1,1)
        pi = nn.Tanh()(self.pi(x))

        return pi

This is the definition of mu.

        all_agents_new_mu_actions = []  # actions according to the regular actor network for the current state
        for agent_idx, agent in enumerate(self.agents):
            # actions according to the regular actor network for the current state
            mu_states = T.tensor(actor_states[agent_idx], dtype=T.float).to(device)
            pi = agent.actor.forward(mu_states)
            all_agents_new_mu_actions.append(pi)

        mu = T.cat([acts for acts in all_agents_new_mu_actions], dim=1)

yeah so i guess you should either compute a different mu for each actor (are those actors different neural nets? if yes I guess each should return its own mu no? Perhaps I’m missing a piece of the puzzle)

Otherwise you can still call backward(retain_graph=True) but I would do that only if I’m 100% sure that the same parameters need to be included in all the actors graphs (i.e. do all your actor loss share a set of parameters or are they supposed to be completly independent?)

Hi, I have added some codes. There are two parts to my project: the actor and the critic. The actor is used for deciding actions, while the ciritic will use actions and states to calculate the state-action function (Q function), which will judge the quality of actions.
States are just some information of agents and mu is actions of those agents according to the regular actor network for current states. So, mu is a collection of actions, which are outputs of actor networks, and for each agent, I just want to use its partial derivative to update its own actor network.

The problem is that, for each agent, I can calculate the actor_loss, but when updating, all actor networks will change because mu contains actions of all actor networks.

Ok now i get it
you should probably not concatenate your pis :slight_smile: and only use the one that is relevant to the current actor (if that makes sense)
Otherwise you can simply use retain_graph=True, that should work
One comment: you can simply call T.cat(all_agents_new_mu_actions, 0), no need to make a list out of a list

Each agent has independent networks(actor networks and critic networks), and if I only use the action that is relevant to the current actor to calculate, the whole algorithm would not converge.

I have tried to call backward(retain_graph=True) but when using the optimizer actor.optimizer.step(), all agents’ networks will update their weights. So, it is ok for the first agent to update. However, for the second agent, another problem happens: One of the variables needed for gradient computation has been modified by an inplace operation.

you could do

if steps_total % self.freq == 0 and steps_total > 0:
    for agent_idx in range(self.n_agents):
            # actor loss and gradient calculation 
            self.agents[agent_idx].actor_loss = self.agents[agent_idx].critic.Q1(states, mu)
            self.agents[agent_idx].actor_loss = -T.mean(self.agents[agent_idx].actor_loss)
            self.agents[agent_idx].actor_loss.backward(retain_graph=True)

    for agent_idx in range(self.n_agents):
            self.agents[agent_idx].actor.optimizer.step()
    for agent_idx in range(self.n_agents):
            self.agents[agent_idx].actor.optimizer.zero_grad()

that way all gradients are gathered, then steps are made and the graphs all hold when needed

I have modified and run it, but I am not sure with this sulotion because normally we will call optimizer.zero_grad(), then .backward(), and final, optimizer.step().

Is this follow version more reasonable?

if steps_total % self.freq == 0 and steps_total > 0:
    for agent_idx in range(self.n_agents):
            self.agents[agent_idx].actor.optimizer.zero_grad()
    for agent_idx in range(self.n_agents):
            # actor loss and gradient calculation 
            self.agents[agent_idx].actor_loss = self.agents[agent_idx].critic.Q1(states, mu)
            self.agents[agent_idx].actor_loss = -T.mean(self.agents[agent_idx].actor_loss)
            self.agents[agent_idx].actor_loss.backward(retain_graph=True)
    for agent_idx in range(self.n_agents):
            self.agents[agent_idx].actor.optimizer.step()

you can call zero_grad before backward or after step, I don’t see any difference.
Personally I prefer after step, it makes more sense to me

I almost have the same problem, I wish you will help me.
have a number “N” of agents, and each agent is independent, this is the old code,
and it gives this error

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation.

all_agents = []

all_agents.append (Agent (actor_dims, critic_dims))

for agent_idx, agent in enumerate (all_agents):
    i = agent.agent_label
    critic_value_ = agent.target_critic.forward (states_[i], new_actions_cluster[i]).flatten ()

    critic_value = agent.critic.forward (states[i], old_actions_cluster[i]).flatten ()

    target = rewards[:, agent_idx] + agent.gamma * critic_value_

    critic_loss= F.mse_loss (critic_value.float (), target.float ())

    agent.critic.optimizer.zero_grad ()
    critic_loss.backward (retain_graph=True)

    actor_loss = agent.critic.forward (states[i], mu_cluster[i]).flatten ()
    actor_loss = -(T.mean (actor_loss))

    agent.actor.optimizer.zero_grad ()
    actor_loss.backward ()

    agent.critic.optimizer.step ()
    agent.actor.optimizer.step ()```
 

then I modified it to be as you mentioned here as follows, it works but is slow, but I’m not sure it is correct. so is this correct or not?

all_agents = []

all_agents.append (Agent (actor_dims, critic_dims))

for agent_idx, agent in enumerate (all_agents):
    i = agent.agent_label
    critic_value_ = agent.target_critic.forward (states_[i], new_actions_cluster[i]).flatten ()

    critic_value = agent.critic.forward (states[i], old_actions_cluster[i]).flatten ()

    target = rewards[:, agent_idx] + agent.gamma * critic_value_

    agent.critic_loss= F.mse_loss (critic_value.float (), target.float ())
    agent.critic_loss.backward (retain_graph=True)

for agent_idx, agent in enumerate (all_agents):
    agent.critic.optimizer.zero_grad ()
for agent_idx, agent in enumerate (all_agents):
    agent.critic.optimizer.zero_grad ()

for agent_idx, agent in enumerate (all_agents):
    i = agent.agent_label
    agent.actor_loss = agent.critic.forward (states[i], mu_cluster[i], typ).flatten ()
    agent.actor_loss = -T.mean (agent.actor_loss)
    agent.actor_loss.backward (retain_graph=True)

for agent_idx, agent in enumerate (all_agents):
    agent.actor.optimizer.step ()
    # agent.actor.optimizer.zero_grad ()

for agent_idx, agent in enumerate (all_agents):
    agent.actor.optimizer.zero_grad ()

also, I think I should remove retain_graph=True, but when I remove it gives another error :slight_smile:

RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.

Hi @TMRajeh
A couple of things

  1. You do a backward and then zero_grad, so if I’m not wrong you should end up with no grad at all.
  2. I don’t understand why the two losses affect the critic network. Also it seems that when doing zero_grad you apply that to the same network twice in a row.
  3. Related to this: where do you use agent.actor for the actor loss? It should be in the actor loss loop no?