Trust region update

ethancaballero · June 4, 2017, 11:52pm

@AjayTalati ACER’s TRPO is different than TRPO implementations you posted. ACER’s TRPO is a first order constraint/derivative with respect to the running average of past parameter configurations; being first order makes it less computationally expensive and more computationally tractable for larger spaces.
The @mjacar @Ilya_Kostrikov TRPO implementations you posted are the old (original Schulman 2015) version that uses second order constraint/derivative with respect to current parameter configuration. Old version is more computationally expensive and intractable for larger spaces.
Also, Malmo version is missing some changes, so refer to gist version for now:

gist.github.com

https://gist.github.com/ethancaballero/ab19ec0b3e5d8ab2a9f515b6125d6c80

trpo.py

# Adapted from https://github.com/pfnet/chainerrl/blob/master/chainerrl/agents/acer.py#L203
import torch
from torch import nn
from torch.autograd import Variable
from torch.nn import functional as F

def update_avg(shared_avg_model, model, alpha):
    for shared_avg_param, param in zip(shared_avg_model.parameters(), model.parameters()):
        shared_avg_param.data = shared_avg_param.data*alpha.expand_as(shared_avg_param) + (1-alpha).expand_as(param)*param.data

This file has been truncated. show original

AjayTalati · June 5, 2017, 1:42am

Hi Ethan @ethancaballero,

thanks a lot, that’s really helpful - seems pretty understandable

So this version (in the gist above) uses only first order constraints/derivative - I guess it’s what John Schulman calls, “Proximal Policy Optimization”, in his Nips 2016 Tutorial - around 42:50,

I’m quite new to ACER and TRPO - to be honest I’m still working on understanding them both - do you have a standalone/simple PyTorch implementation of ACER with TRPO, for say cartpole?

The gist say’s

Computes a trust region loss based on an existing loss and two distributions
model/distribution/loss is from the most recent params; ref_model/ref_distribution is from the average model’s params

It should all make sense once I see the model, ref_model, distribution, ref_distribution, loss, and how you calculate kl_div for a simple example. Or maybe I could work on modifying, @mjacar’s TRPO implementation, with him?

All the best,

Ajay

Kaixhin · June 5, 2017, 1:27pm

I’ve just released my ACER repo. As I mentioned, it’s not definitely 100% working, and I don’t have time to work on it immediately, so happy to have others help out