Question About If PPO Training Will Work

Rohan_Jay · July 28, 2025, 4:33pm

Hi. I am relatively inexperienced in PyTorch so I apologize if this question seems silly.
I am working with some friends on implementing PPO from scratch using PyTorch, and part of what I have to do is training the actor and critic network. From what I understand, the cost function of the actor should be the output of the critic. I have attached my current code below.
I have read that PyTorch builds a computational tree of all the computations done to/with a tensor, and .backwards() uses this tree to calculate the gradients and stores it in the tensor. Then .step(), for the tensor passed in(in this case actor.parameters) uses the stored gradients to update the tensor.
However, since I am training both the actor and critic, and am using the critic to calculate the actor’s loss, would this cause a conflict? Would the critic contain the gradients for the actor’s training as well, interfering with the critic’s training?

def train(actor, critic, env):
	criterion = torch.nn.MSELoss(reduction = 'sum')
	actor_optimizer = torch.optim.SGD(actor.parameters())
	critic_optimizer = torch.optim.SGD(critic.parameters())
	for batch in range(1000):
		env.reset()
		the_action = env.action_space.sample()
	        for t in range(1000):
		          state, act, reward = get_state(env, the_action)
		          critic_pred = critic(the_action)
                         critic_loss = criterion(reward, critic_pred)
                         actor_loss = critic(the_action)#Here

                         #Here
                         actor_optimizer.zero_grad()
                         actor_loss.backward()
                         actor_optimizer.step()

                         critic_optimizer.zero_grad()
                         critic_loss.backward()
                         critic_optimizer.step()

                         the_action = actor(state)

KFrank · July 29, 2025, 6:37pm

Hi Oliver (and Rohan)!

Without commenting on the entirety of Rohan’s code, your analysis is not correct.

This part is true – gradients will be accumulated into critic.

However, this is incorrect because Rohan calls critic_optimizer.zero_grad()
before calling critic_loss.backward(). Any “contamination” from the call to
actor_loss.backward() gets zeroed out before critic’s gradients get correctly
populated by the call to critic_loss.backward().

One last comment:

Rohan creates two copies of critic’s computation graph – once
with critic_pred = critic(the_action) and then again with
actor_loss = critic(the_action).

This matters because (as written) the call to actor_loss.backward() releases the
actor_loss copy of critic’s computation graph. But that’s okay, because the call
to critic_loss.backward() uses the critic_pred copy of the computation graph
and it hasn’t been released yet.

So, as it stands, I think (this part of) Rohan’s code should work.

Best.

K. Frank