I am learning the PPO sample code in the tutorial linked below.
https://pytorch.org/rl/tutorials/coding_ppo.html
I want to save a learned model of PPO that is reproducible as I will be doing multiple tests against PPO. I know that saving the model can generally be done with torch.save. However, there are multiple networks of PPOs. Which network do I need to save?
At least, I could not ensure reproducibility by just saving the policy network.
I changed the code in the policy evaluation part of the PPO as follows, but the saved policy network does not reproduce the values stored in the logs
variable.
with set_exploration_mode("mean"), torch.no_grad():
# execute a rollout with the trained policy
torch.save(policy_module.state_dict(), "saved_policy_network" # save current policy network (But the logs saved below in the PPO that loaded this are not reproduced)
eval_rollout = env.rollout(1000, policy_module)
env.transform.dump()
logs["eval reward"].append(eval_rollout["next", "reward"].mean().item())
logs["eval reward (sum)"].append(
eval_rollout["next", "reward"].sum().item()
)
logs["eval step_count"].append(eval_rollout["step_count"].max().item())
eval_str = (
f"eval cumulative reward: {logs['eval reward (sum)'][-1]: 4.4f} "
f"(init: {logs['eval reward (sum)'][0]: 4.4f}), "
f"eval step-count: {logs['eval step_count'][-1]}"
)
del eval_rollout
np.save("saved_logs", np.array(dict(logs))) # save current logs