Hi,
I’m implementing MAML in the Reinforcement Learning domain. I’m planning to use PPO training for validation. However I want to know if the PPO loss function is appropriate during Meta-Training. The emphasis on gradual policy updates may not be suitable for fast adaptation to tasks in the inner loop?
I’ve seen some implementations use just the policy loss. However, I do not understand how you can calculate an accurate policy loss if you have an untrained value network (As the advantage relies on the value estimates).
Can someone help me to understand this?
Thanks