Backward error, although there are two different networks for actor and critic. (PPO implementation)

I have seen the topics discussed about this error,

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

but I think in my case is different. I have implemented a PPO algorithm where the actor and the critic are two completely different networks, and so I backward the actor loss for the actor network, and the critic loss for the critic network, as shown below:

# critic loss backward implementation
self.critic_optimizer.zero_grad()
critic_loss.backward()
nn.utils.clip_grad_norm_(self.critic.parameters(), 0.5)
self.critic_optimizer.step()

# actor loss backward implementation
self.actor_optimizer.zero_grad()
actor_loss.backward() ######### ERROR ARISES HERE
nn.utils.clip_grad_norm_(self.actor.parameters(), 0.5)
self.actor_optimizer.step()

The thing is that although these networks are different, the error occurs when I call the backward of the actor loss (the second backward call in my code), which is not related to the backward of the critic loss. I there a way for pytorch to understand that the actor’s backward is not related with the critic’s backward?

Please show your whole implementation

This error could occur for multiple reasons. One time I observed this when my input tensor had requires_grad=True, so one thing to check is the input tensor.

In this code block I specify the two networks:

class Actor(nn.Module):
    def __init__(self, obs_size, action_size, hidden_size,
                 activation=nn.Tanh()):
        super(Actor, self).__init__()

        self.action = nn.Sequential(
            layer_init(nn.Linear(obs_size, hidden_size)),
            activation,
            layer_init(nn.Linear(hidden_size, hidden_size)),
            activation,
            layer_init(nn.Linear(hidden_size, action_size), std=0.01),
        )

    def forward(self):
        raise NotImplementedError

    def get_action(self, state, action = None):
        logits = self.action(state)
        probs = Categorical(logits=logits)
        if action is None:
            action = probs.sample()
        return action, probs.log_prob(action), probs.entropy()


class Critic(nn.Module):
    def __init__(self, obs_size: int, hidden_size, activation=nn.Tanh()):
        """Initialize."""
        super(Critic, self).__init__()

        self.value = nn.Sequential(
            layer_init(nn.Linear(obs_size, hidden_size)),
            activation,
            layer_init(nn.Linear(hidden_size, hidden_size)),
            activation,
            layer_init(nn.Linear(hidden_size, 1), std=1.)
        )

    def forward(self):
        raise NotImplementedError

    def get_value(self, state):
        return self.value(state)

the networks call:

self.actor = Actor(self.obs_size, self.action_size, self.hidden_size).to(self.device)
self.critic = Critic(self.obs_size, self.hidden_size).to(self.device)

optimizers:

self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=self.actor_lr)
self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=self.critic_lr)

update of the networks function:

    def update_model(self, next_state: np.ndarray):
     
        last_value = self.critic.get_value(next_state.to(self.device)).reshape(1, -1)

        returns, advantages = compute_gae(last_value, self.rewards, self.masks, self.values,
                                          self.gamma, self.lam, self.device)

        # squeeze
        states_traj = self.states.squeeze()
        log_probs_traj = self.log_probs.squeeze()
        actions_traj = self.actions.squeeze()
        advantages_traj = advantages.squeeze()
        returns_traj = returns.squeeze()
        values_traj = self.values.squeeze()

        
        ids = np.arange(self.trajectory_size)
        for epoch in range(self.epochs):
            np.random.shuffle(ids)
            for start in range(0, self.trajectory_size, self.mini_batch_size):
                end = start + self.mini_batch_size
                minibatch_ind = ids[start:end]
                advantages_minib = advantages_traj[minibatch_ind]
                if self.normalize_adv:
                    advantages_minib = (advantages_minib - advantages_minib.mean()) / (advantages_minib.std() + 1e-8)

                _, newlogproba, entropy = self.actor.get_action(states_traj[minibatch_ind],
                                                                actions_traj.long()[minibatch_ind])
                ratio = (newlogproba - log_probs_traj[minibatch_ind]).exp()

                # actor loss
                surr_loss = -advantages_minib * ratio
                clipped_surr_loss = -advantages_minib * torch.clamp(ratio, 1 - self.epsilon,
                                                                    1 + self.epsilon)
                actor_loss_max = torch.max(surr_loss, clipped_surr_loss).mean()
                entropy_loss = entropy.mean()
                actor_loss = actor_loss_max - self.entropy_weight * entropy_loss

                # critic_loss
                new_values = self.critic.get_value(states_traj[minibatch_ind]).view(-1)
                if self.clipped_value_loss:
                    critic_loss_unclipped = (new_values - returns_traj[minibatch_ind]) ** 2
                    value_clipped = values_traj[minibatch_ind] + torch.clamp(new_values -
                                                                             values_traj[minibatch_ind],
                                                                             - self.epsilon, self.epsilon)
                    critic_loss_clipped = (value_clipped - returns_traj[minibatch_ind]) ** 2
                    critic_loss_max = torch.max(critic_loss_clipped, critic_loss_unclipped)
                    critic_loss = 0.5 * critic_loss_max.mean() * self.critic_weight
                else:
                    critic_loss = 0.5 * (new_values - returns_traj[minibatch_ind] ** 2).mean() * self.critic_weight

                loss = actor_loss + critic_loss

                # critic backward implementation
                self.critic_optimizer.zero_grad()
                critic_loss.backward()
                nn.utils.clip_grad_norm_(self.critic.parameters(), 0.5)
                self.critic_optimizer.step()

                # actor backward implementation
                self.actor_optimizer.zero_grad()
                actor_loss.backward()
                nn.utils.clip_grad_norm_(self.actor.parameters(), 0.5)
                self.actor_optimizer.step()

        return actor_loss, critic_loss

You mean the input of the forward pass of a network?

You can add this line:

advantages_minib = advantages_minib.detach()

after:

if self.normalize_adv:
    advantages_minib = (advantages_minib - advantages_minib.mean()) / (advantages_minib.std() + 1e-8)
1 Like

It worked for the first mini batch, as both backwards were made succesfully but when I backwarded with the second mini batch, I got the same error for the critic backward function. So I added

critic_loss.backward(retain_graph=True)
actor_loss.backward(retain_graph=True)

But got this error after the second pass:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64, 1]], which is output 0 of TBackward, is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

No, you should not add “retain_graph=True” because in PPO actor does not need gradient from critic.
you probably should also detach:

returns_traj[minibatch_ind]

basically you actor gets gradients through entropy and your critic gets gradients from value loss. Anything other than these two tensors should also be detached.

No, you should not add “retain_graph=True” because in PPO actor does not need gradient from critic.

So whenever we call retain_graph=True it applies to the next backward call, no matter to which network we applied it initially?

Thank you so much for the help! It runs perfectly right now!

I would really appreciate it if you have time to explain to me why returns_traj[minibatch_ind].detach() and advantages_minib = advantages_minib.detach() played so important role in this error.

retain_graph basically keeps the forward(or backward) path after the backward call, however, torch will use buffers to store temporary data in some forward calls, and these data are utilized in their respective backward pass, their version will be stored so that any update on these buffers will be noticed by torch and an error will be thrown to notify users of this. This mechanism also applies to parameters. Parameters and Buffers are basically tensors with special wraps.

when you call “step”, the optimizer will perform parameter updates, thus their version number will be increased, therefore you got this error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64, 1]], which is output 0 of TBackward, is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

“retain_graph” is mainly used to perform multiple backward pass on networks with multiple outputs (i.e.: multi-head output) before a final call to step().

Detach stops gradient from flowing through that path, so you will not get those errors.

Okay now I see whats the point of retain_graph. Thank you for the explanation.

You also said:

basically you actor gets gradients through entropy and your critic gets gradients from value loss.

This means that the actor network is being updated through actor loss, which consists of the surrogate loss (clipped or unclipped) discounted by the entropy. We thus update the weights of the actor based on backpropagatiion of the entropy gradeints in order to change the policy distribution.

As for the critic, we detached the returns. And so the gradients of the predicted values from the critic network are going to backpropagate and update the weights.

Are the above statements true?

correct.

And actor will also receive gradients from entropy if you are adding entropy loss to actor loss.

You know even if you have a good understanding of an algorithm, its code execution and details like these, are difficult to be recognized by only reading the paper and pseudocode.

Again, thank you very much! @iffiX

:smile: Glad I could help