How to save a trained model in a PPO sample

4gv8pein · February 18, 2024, 2:54pm

I am learning the PPO sample code in the tutorial linked below.
https://pytorch.org/rl/tutorials/coding_ppo.html

I want to save a learned model of PPO that is reproducible as I will be doing multiple tests against PPO. I know that saving the model can generally be done with torch.save. However, there are multiple networks of PPOs. Which network do I need to save?

At least, I could not ensure reproducibility by just saving the policy network.
I changed the code in the policy evaluation part of the PPO as follows, but the saved policy network does not reproduce the values stored in the logs variable.

            with set_exploration_mode("mean"), torch.no_grad():
                # execute a rollout with the trained policy
                torch.save(policy_module.state_dict(), "saved_policy_network" # save current policy network (But the logs saved below in the PPO that loaded this are not reproduced)
                eval_rollout = env.rollout(1000, policy_module)
                env.transform.dump()
                logs["eval reward"].append(eval_rollout["next", "reward"].mean().item())
                logs["eval reward (sum)"].append(
                    eval_rollout["next", "reward"].sum().item()
                )
                logs["eval step_count"].append(eval_rollout["step_count"].max().item())
                eval_str = (
                    f"eval cumulative reward: {logs['eval reward (sum)'][-1]: 4.4f} "
                    f"(init: {logs['eval reward (sum)'][0]: 4.4f}), "
                    f"eval step-count: {logs['eval step_count'][-1]}"
                )
                del eval_rollout
                np.save("saved_logs", np.array(dict(logs))) # save current logs

4gv8pein · February 19, 2024, 6:39pm

I am trying to fix the seed at the beginning of the program, but that does not work. It may be that the seed values are no longer consistent when training and when using the learned model. But in any case, I’m still unable to ensure reproducibility regarding PPO sample code…

J_Johnson · February 22, 2024, 1:37pm

You need to also save the value_module.

4gv8pein · February 24, 2024, 1:46pm

Thank you for your reply.
It looks like value_module is not being used at the time of the test, can I just save the value_model at test time as well as the policy_module?
Do I need to save the value_model for every epoch?

4gv8pein · February 24, 2024, 2:52pm

When I did all of these document seed fixes and SyncDataCollector seed set, and then I saved the policy_module and value_module, I obtained almost the same results in the runtime and saved models.

However, as shown in the following tensordict, only the reward results differently each time. Do I need to fix further seed or am I missing something?

tensordict of logs variable

{'orig_reward': [9.090715408325195], 
'reward': [10.649359703063965],  # this value is not matched
'step_count': [14], 
'lr': [0.0003], 
'eval reward': [9.1824312210083], 
'eval reward (sum)': [73.4594497680664], 
'eval step_count': [7]}

tensordict of imported saved model (policy and value)

{'orig_reward': [9.090715408325195], 
'reward': [10.670992851257324], # this value is not matched 
'step_count': [14], 
'lr': [0.0003], 
'eval reward': [9.1824312210083], 
'eval reward (sum)': [73.4594497680664], 
'eval step_count': [7]}

Seed fixing part of the code

    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True
    torch.use_deterministic_algorithms(True)
    # ....
    _ = env.set_seed(seed)
    # ....
    collector = SyncDataCollector(
        env,
        policy_module,
        frames_per_batch=frames_per_batch,
        total_frames=total_frames,
        split_trajs=False,
        device=device,
    )
    collector.set_seed(seed)

Test learned policy part of the code

            with set_exploration_mode("mean"), torch.no_grad():
                # Use these part when load and use a leaned policy and learned value module.
                # policy_module.load_state_dict(torch.load("./models/policy.model"))
                # policy_module.eval()
                # value_module.load_state_dict(torch.load("./models/value.model"))
                # value_module.eval()

                # execute a rollout with the trained policy

                eval_rollout = env.rollout(1000, policy_module)
                env.transform.dump()
                logs["eval reward"].append(eval_rollout["next", "reward"].mean().item())
                logs["eval reward (sum)"].append(
                    eval_rollout["next", "reward"].sum().item()
                )
                logs["eval step_count"].append(eval_rollout["step_count"].max().item())
                eval_str = (
                    f"eval cumulative reward: {logs['eval reward (sum)'][-1]: 4.4f} "
                    f"(init: {logs['eval reward (sum)'][0]: 4.4f}), "
                    f"eval step-count: {logs['eval step_count'][-1]}"
                )
                del eval_rollout

                # Use these part when save policy and value module.                
                #torch.save(policy_module.state_dict(), "./models/policy.model")
                #torch.save(value_module.state_dict(), "./models/value.model")

                np.save("./saved-logs.npy", np.array(dict(logs)))