Implementation of Hierarchical Actor Critic with PPolicy-on Policy-off Policy Optimization for primitive actions

Hello everyone,

I’m currently undertaking a Hierarchical Reinforcement Learning (HRL) project and I am attempting to implement a hybrid architecture based on the Hierarchical Actor-Critic (HAC) framework.

I started from the original HAC algorithm which uses DDPG and HER at all 3 layers, then tried to replace the lowest-level which controls the primitive actions (e.g., joint torques in a robotic manipulator), with Policy-on Policy-off Policy Optimization (P3O) algorithm. (At first I tried to use PPO but P3O seemed to fit the off-policy architecture of the higher layers better)

I am using the environment found in the original HAC implementation: https://github.com/andrew-j-levy/Hierarchical-Actor-Critc-HAC-

I have managed to integrate the P3O agent into the hierarchical loop. The overall system runs, policies are being sampled, and updates are being performed, but the agent is not learning any effective behavior. The high-level policies seem to struggle to set meaningful subgoals, and the low-level P3O agent never converges to a policy that reliably reaches the assigned subgoals.

I would like to know if anybody heard of something similar before, and/or if this architecture seems feasible and maybe useful for some cases.

Thank you in advance for any guidance.