Synchronous updates for DPPO

I am trying to implement Deepmind’s Distributed Proximal Policy Optimization (https://arxiv.org/abs/1707.02286)

But I am not really confident with multiprocessing, and I don’t see how to realise the synchronous updates. The idea is to have a chief process that collects the gradients sent by the training workers. When the chief receives enough gradients (more than N gradients) it sums them and does the optimizer’s update. The workers that send the gradients have to wait for the chief’s update before continuing their runs. (the paper explains it in a much nicer way in the supplemental section.)

Also, unfortunately, all the hyperparameters are not provided by Google. I have to tune it by myself.

My code is there:


So far, I am trying to solve a simple pendulum environment (and it’s not even converging).
Don’t hesitate to contribute, I need help!

1 Like

ELF (rlpytorch) provides source code for synchronous updates:

1 Like

Actually, nvm. ELF is still asynchronous, but it uses batching & bucketing so that it’s less asynchronous than normal A3C.

ELF updates look synchronous to me. When used in this context it mean using the same clock so the processes are actually done sequentialially.

You could use that as reference but instead of when they update parameters from queue one by one just change that to how you want the gradients accumulated then updated.

You will want to use a queque to do this synchronous

Checkout the code on this part for refs https://github.com/facebookresearch/ELF/blob/master/rlpytorch/utils.py

1 Like

@dgriff Have you gotten ELF working with gym?
I’m currently trying to modify ELF source code to work with gym.

Not sure what you mean ELF suppose to be alternative to gym.

Or do you mean their actor critic example they provide which looks to be the same model as tensorpack example that does gym which has same author so probably is the case. So I would assume same performance as well which is a quite a good implementation. But have no plans on changing this example to run in gym sorry.

I think I solved my synchronous problem. I added 3 function to my shared models:

    def cum_grads(self):
        for name, p in self.named_parameters():
            if p.grad is not None:
                val = self.__getattr__(name+'_grad')
                val += p.grad.data
                self.__setattr__(name+'_grad', val)

    def reset_grads(self):
        self.zero_grad()
        for name, p in self.named_parameters():
            if p.grad is not None:
                val = self.__getattr__(name+'_grad')
                val = p.grad.data
                self.__setattr__(name+'_grad', val)

    def synchronize(self):
        for name, p in self.named_parameters():
            val = self.__getattr__(name+'_grad')
            p._grad = Variable(val)

When a worker have computed a new loss, I use cum_grads:

ensure_shared_grads(policy, shared_p)
shared_p.cum_grads()
counter.increment()

And the chief just loops as follow:

def chief(rank, params, traffic_light, counter, shared_p, shared_v, optimizer_p, optimizer_v):
    while True:
        time.sleep(1)
        # workers will wait after last loss computation
        if counter.get() > params.update_treshold:
            shared_p.synchronize()
            shared_v.synchronize()
            optimizer_p.step()
            optimizer_v.step()
            counter.reset()
            shared_p.reset_grads()
            shared_v.reset_grads()
            traffic_light.switch() # workers start new loss computation

It’s looks to be somewhat still asynchronous. One quick test you can do to see if synchronous is that you should no longer have to use a special shared optimizer and should now work with just the regular pytorch optimizers

1 Like

Good point! I don’t longer need the shared optimizer. It still works with a simple one.

1 Like

In fact, my implementation only solves the invertpendulum environment. I am searching for any help, but the paper is not clearly describing the pseudocode (a lot of sub-steps and hyperparameters are not detailed either in DPPO paper and PPO_clip paper). Besides, the only implementation I found is the official TF implementation by John Schulman, which I almost rigorously translated into Pytorch, without success.

Unfortunately for me, I can’t even try the TF implementation, since python3 version of mujoco seems incompatible with my machine :confused:. It could have at least provided me with some helpful expectation, like what looks like a “working” learning curve.

If any one has some useful documentation or an implementation that works (any language/library) of any close algorithm (TRPO, A2C), please tell me

Little confused on somethings I see in train.py

What’s the whole updating old model right before you call backward and step about?

Also are you asking for help on converting model to solve other environments or that model is not able to solve other environments?

What’s the whole updating old model right before you call backward and step about?

If I do it after the backward step or anywhere else, the old and new models will have same weights… I can also update the old model once, it is unclear in the paper if this update must be done once or at every steps.

Also are you asking for help on converting model to solve other environments or that model is not able to solve other environments?

According to the ppo paper, the set of hyperparameters I use in ppo.py should work in every environment with no modification at all. But only if I change both the batch size and the number of steps to 1000, it solves the InvertPendulum (quickly) and the InvertDoublePendulum (much slower than in the paper). With batch size = 64 and number of steps = 2048 as suggested by the paper, It solves nothing.

So I am quite certain there is something wrong somewhere in the training loop, but I can’t figure out what is wrong.

https://github.com/ikostrikov/pytorch-a2c-ppo-acktr: an implementation that seems to work properly!
Now I can go back to this project and check what was wrong with my code.

Thanks and Congrats @Ilya_Kostrikov!

Finally, I found what was wrong, thanks to Oleg Klimov (https://github.com/openai/baselines/issues/87#issuecomment-363235729)!

Hello, David. This is a really clear and concise implementation of DPPO! Only a question, how do you guarantee the variation in actions between different workers. In my game environment, every worker faces the same initial state, and I use torch.distributions.sample function to sample actions in worker. Unfortunately, I found that the continuous actions sampled by different workers are exactly the same! I use

np.random.seed(params.seed)
torch.manual_seed(params.seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

to guarantee the reproducibility of the algorithm. So I think it might have some influence on the variation between workers. And I wonder how you solve it.