I am trying to implement Deepmind’s Distributed Proximal Policy Optimization (https://arxiv.org/abs/1707.02286)
But I am not really confident with multiprocessing, and I don’t see how to realise the synchronous updates. The idea is to have a chief process that collects the gradients sent by the training workers. When the chief receives enough gradients (more than N gradients) it sums them and does the optimizer’s update. The workers that send the gradients have to wait for the chief’s update before continuing their runs. (the paper explains it in a much nicer way in the supplemental section.)
Also, unfortunately, all the hyperparameters are not provided by Google. I have to tune it by myself.
My code is there:
So far, I am trying to solve a simple pendulum environment (and it’s not even converging).
Don’t hesitate to contribute, I need help!