Hi. I am relatively inexperienced in PyTorch so I apologize if this question seems silly.
I am working with some friends on implementing PPO from scratch using PyTorch, and part of what I have to do is training the actor and critic network. From what I understand, the cost function of the actor should be the output of the critic. I have attached my current code below.
I have read that PyTorch builds a computational tree of all the computations done to/with a tensor, and .backwards() uses this tree to calculate the gradients and stores it in the tensor. Then .step(), for the tensor passed in(in this case actor.parameters) uses the stored gradients to update the tensor.
However, since I am training both the actor and critic, and am using the critic to calculate the actor’s loss, would this cause a conflict? Would the critic contain the gradients for the actor’s training as well, interfering with the critic’s training?
def train(actor, critic, env):
criterion = torch.nn.MSELoss(reduction = 'sum')
actor_optimizer = torch.optim.SGD(actor.parameters())
critic_optimizer = torch.optim.SGD(critic.parameters())
for batch in range(1000):
env.reset()
the_action = env.action_space.sample()
for t in range(1000):
state, act, reward = get_state(env, the_action)
critic_pred = critic(the_action)
critic_loss = criterion(reward, critic_pred)
actor_loss = critic(the_action)#Here
#Here
actor_optimizer.zero_grad()
actor_loss.backward()
actor_optimizer.step()
critic_optimizer.zero_grad()
critic_loss.backward()
critic_optimizer.step()
the_action = actor(state)