Trust region update

@AjayTalati ACER’s TRPO is different than TRPO implementations you posted. ACER’s TRPO is a first order constraint/derivative with respect to the running average of past parameter configurations; being first order makes it less computationally expensive and more computationally tractable for larger spaces.
The @mjacar @Ilya_Kostrikov TRPO implementations you posted are the old (original Schulman 2015) version that uses second order constraint/derivative with respect to current parameter configuration. Old version is more computationally expensive and intractable for larger spaces.
Also, Malmo version is missing some changes, so refer to gist version for now:

2 Likes

Hi Ethan @ethancaballero,

thanks a lot, that’s really helpful - seems pretty understandable :slight_smile:

So this version (in the gist above) uses only first order constraints/derivative - I guess it’s what John Schulman calls, “Proximal Policy Optimization”, in his Nips 2016 Tutorial - around 42:50,

I’m quite new to ACER and TRPO - to be honest I’m still working on understanding them both - do you have a standalone/simple PyTorch implementation of ACER with TRPO, for say cartpole?

The gist say’s

Computes a trust region loss based on an existing loss and two distributions
model/distribution/loss is from the most recent params; ref_model/ref_distribution is from the average model’s params

It should all make sense once I see the model, ref_model, distribution, ref_distribution, loss, and how you calculate kl_div for a simple example. Or maybe I could work on modifying, @mjacar’s TRPO implementation, with him?

All the best,

Ajay

1 Like

I’ve just released my ACER repo. As I mentioned, it’s not definitely 100% working, and I don’t have time to work on it immediately, so happy to have others help out :slight_smile:

1 Like