Backward error, although there are two different networks for actor and critic. (PPO implementation)

Kimonili · August 16, 2020, 1:46pm

I have seen the topics discussed about this error,

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

but I think in my case is different. I have implemented a PPO algorithm where the actor and the critic are two completely different networks, and so I backward the actor loss for the actor network, and the critic loss for the critic network, as shown below:

# critic loss backward implementation
self.critic_optimizer.zero_grad()
critic_loss.backward()
nn.utils.clip_grad_norm_(self.critic.parameters(), 0.5)
self.critic_optimizer.step()

# actor loss backward implementation
self.actor_optimizer.zero_grad()
actor_loss.backward() ######### ERROR ARISES HERE
nn.utils.clip_grad_norm_(self.actor.parameters(), 0.5)
self.actor_optimizer.step()

The thing is that although these networks are different, the error occurs when I call the backward of the actor loss (the second backward call in my code), which is not related to the backward of the critic loss. I there a way for pytorch to understand that the actor’s backward is not related with the critic’s backward?

iffiX · August 16, 2020, 1:54pm

Please show your whole implementation

vmirly1 · August 16, 2020, 1:57pm

This error could occur for multiple reasons. One time I observed this when my input tensor had requires_grad=True, so one thing to check is the input tensor.

Kimonili · August 16, 2020, 2:03pm

In this code block I specify the two networks:

class Actor(nn.Module):
    def __init__(self, obs_size, action_size, hidden_size,
                 activation=nn.Tanh()):
        super(Actor, self).__init__()

        self.action = nn.Sequential(
            layer_init(nn.Linear(obs_size, hidden_size)),
            activation,
            layer_init(nn.Linear(hidden_size, hidden_size)),
            activation,
            layer_init(nn.Linear(hidden_size, action_size), std=0.01),
        )

    def forward(self):
        raise NotImplementedError

    def get_action(self, state, action = None):
        logits = self.action(state)
        probs = Categorical(logits=logits)
        if action is None:
            action = probs.sample()
        return action, probs.log_prob(action), probs.entropy()


class Critic(nn.Module):
    def __init__(self, obs_size: int, hidden_size, activation=nn.Tanh()):
        """Initialize."""
        super(Critic, self).__init__()

        self.value = nn.Sequential(
            layer_init(nn.Linear(obs_size, hidden_size)),
            activation,
            layer_init(nn.Linear(hidden_size, hidden_size)),
            activation,
            layer_init(nn.Linear(hidden_size, 1), std=1.)
        )

    def forward(self):
        raise NotImplementedError

    def get_value(self, state):
        return self.value(state)

the networks call:

self.actor = Actor(self.obs_size, self.action_size, self.hidden_size).to(self.device)
self.critic = Critic(self.obs_size, self.hidden_size).to(self.device)

optimizers:

self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=self.actor_lr)
self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=self.critic_lr)

update of the networks function:

    def update_model(self, next_state: np.ndarray):
     
        last_value = self.critic.get_value(next_state.to(self.device)).reshape(1, -1)

        returns, advantages = compute_gae(last_value, self.rewards, self.masks, self.values,
                                          self.gamma, self.lam, self.device)

        # squeeze
        states_traj = self.states.squeeze()
        log_probs_traj = self.log_probs.squeeze()
        actions_traj = self.actions.squeeze()
        advantages_traj = advantages.squeeze()
        returns_traj = returns.squeeze()
        values_traj = self.values.squeeze()

        
        ids = np.arange(self.trajectory_size)
        for epoch in range(self.epochs):
            np.random.shuffle(ids)
            for start in range(0, self.trajectory_size, self.mini_batch_size):
                end = start + self.mini_batch_size
                minibatch_ind = ids[start:end]
                advantages_minib = advantages_traj[minibatch_ind]
                if self.normalize_adv:
                    advantages_minib = (advantages_minib - advantages_minib.mean()) / (advantages_minib.std() + 1e-8)

                _, newlogproba, entropy = self.actor.get_action(states_traj[minibatch_ind],
                                                                actions_traj.long()[minibatch_ind])
                ratio = (newlogproba - log_probs_traj[minibatch_ind]).exp()

                # actor loss
                surr_loss = -advantages_minib * ratio
                clipped_surr_loss = -advantages_minib * torch.clamp(ratio, 1 - self.epsilon,
                                                                    1 + self.epsilon)
                actor_loss_max = torch.max(surr_loss, clipped_surr_loss).mean()
                entropy_loss = entropy.mean()
                actor_loss = actor_loss_max - self.entropy_weight * entropy_loss

                # critic_loss
                new_values = self.critic.get_value(states_traj[minibatch_ind]).view(-1)
                if self.clipped_value_loss:
                    critic_loss_unclipped = (new_values - returns_traj[minibatch_ind]) ** 2
                    value_clipped = values_traj[minibatch_ind] + torch.clamp(new_values -
                                                                             values_traj[minibatch_ind],
                                                                             - self.epsilon, self.epsilon)
                    critic_loss_clipped = (value_clipped - returns_traj[minibatch_ind]) ** 2
                    critic_loss_max = torch.max(critic_loss_clipped, critic_loss_unclipped)
                    critic_loss = 0.5 * critic_loss_max.mean() * self.critic_weight
                else:
                    critic_loss = 0.5 * (new_values - returns_traj[minibatch_ind] ** 2).mean() * self.critic_weight

                loss = actor_loss + critic_loss

                # critic backward implementation
                self.critic_optimizer.zero_grad()
                critic_loss.backward()
                nn.utils.clip_grad_norm_(self.critic.parameters(), 0.5)
                self.critic_optimizer.step()

                # actor backward implementation
                self.actor_optimizer.zero_grad()
                actor_loss.backward()
                nn.utils.clip_grad_norm_(self.actor.parameters(), 0.5)
                self.actor_optimizer.step()

        return actor_loss, critic_loss

Kimonili · August 16, 2020, 2:05pm

You mean the input of the forward pass of a network?

iffiX · August 16, 2020, 2:14pm

You can add this line:

advantages_minib = advantages_minib.detach()

after:

if self.normalize_adv:
    advantages_minib = (advantages_minib - advantages_minib.mean()) / (advantages_minib.std() + 1e-8)

Kimonili · August 16, 2020, 2:47pm

It worked for the first mini batch, as both backwards were made succesfully but when I backwarded with the second mini batch, I got the same error for the critic backward function. So I added

critic_loss.backward(retain_graph=True)

actor_loss.backward(retain_graph=True)

But got this error after the second pass:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64, 1]], which is output 0 of TBackward, is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

iffiX · August 16, 2020, 2:51pm

No, you should not add “retain_graph=True” because in PPO actor does not need gradient from critic.
you probably should also detach:

returns_traj[minibatch_ind]

basically you actor gets gradients through entropy and your critic gets gradients from value loss. Anything other than these two tensors should also be detached.

Kimonili · August 16, 2020, 3:01pm

No, you should not add “retain_graph=True” because in PPO actor does not need gradient from critic.

So whenever we call retain_graph=True it applies to the next backward call, no matter to which network we applied it initially?

Thank you so much for the help! It runs perfectly right now!

I would really appreciate it if you have time to explain to me why returns_traj[minibatch_ind].detach() and advantages_minib = advantages_minib.detach() played so important role in this error.

iffiX · August 16, 2020, 3:09pm

retain_graph basically keeps the forward(or backward) path after the backward call, however, torch will use buffers to store temporary data in some forward calls, and these data are utilized in their respective backward pass, their version will be stored so that any update on these buffers will be noticed by torch and an error will be thrown to notify users of this. This mechanism also applies to parameters. Parameters and Buffers are basically tensors with special wraps.

when you call “step”, the optimizer will perform parameter updates, thus their version number will be increased, therefore you got this error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64, 1]], which is output 0 of TBackward, is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

“retain_graph” is mainly used to perform multiple backward pass on networks with multiple outputs (i.e.: multi-head output) before a final call to step().

Detach stops gradient from flowing through that path, so you will not get those errors.

Kimonili · August 16, 2020, 3:26pm

Okay now I see whats the point of retain_graph. Thank you for the explanation.

You also said:

basically you actor gets gradients through entropy and your critic gets gradients from value loss.

This means that the actor network is being updated through actor loss, which consists of the surrogate loss (clipped or unclipped) discounted by the entropy. We thus update the weights of the actor based on backpropagatiion of the entropy gradeints in order to change the policy distribution.

As for the critic, we detached the returns. And so the gradients of the predicted values from the critic network are going to backpropagate and update the weights.

Are the above statements true?

iffiX · August 16, 2020, 3:28pm

correct.

And actor will also receive gradients from entropy if you are adding entropy loss to actor loss.

Kimonili · August 16, 2020, 3:31pm

You know even if you have a good understanding of an algorithm, its code execution and details like these, are difficult to be recognized by only reading the paper and pseudocode.

Again, thank you very much! @iffiX

iffiX · August 16, 2020, 3:32pm

Glad I could help