But I am not really confident with multiprocessing, and I don’t see how to realise the synchronous updates. The idea is to have a chief process that collects the gradients sent by the training workers. When the chief receives enough gradients (more than N gradients) it sums them and does the optimizer’s update. The workers that send the gradients have to wait for the chief’s update before continuing their runs. (the paper explains it in a much nicer way in the supplemental section.)
Also, unfortunately, all the hyperparameters are not provided by Google. I have to tune it by myself.
My code is there:
So far, I am trying to solve a simple pendulum environment (and it’s not even converging).
Don’t hesitate to contribute, I need help!
ELF updates look synchronous to me. When used in this context it mean using the same clock so the processes are actually done sequentialially.
You could use that as reference but instead of when they update parameters from queue one by one just change that to how you want the gradients accumulated then updated.
You will want to use a queque to do this synchronous
Not sure what you mean ELF suppose to be alternative to gym.
Or do you mean their actor critic example they provide which looks to be the same model as tensorpack example that does gym which has same author so probably is the case. So I would assume same performance as well which is a quite a good implementation. But have no plans on changing this example to run in gym sorry.
I think I solved my synchronous problem. I added 3 function to my shared models:
def cum_grads(self):
for name, p in self.named_parameters():
if p.grad is not None:
val = self.__getattr__(name+'_grad')
val += p.grad.data
self.__setattr__(name+'_grad', val)
def reset_grads(self):
self.zero_grad()
for name, p in self.named_parameters():
if p.grad is not None:
val = self.__getattr__(name+'_grad')
val = p.grad.data
self.__setattr__(name+'_grad', val)
def synchronize(self):
for name, p in self.named_parameters():
val = self.__getattr__(name+'_grad')
p._grad = Variable(val)
When a worker have computed a new loss, I use cum_grads:
def chief(rank, params, traffic_light, counter, shared_p, shared_v, optimizer_p, optimizer_v):
while True:
time.sleep(1)
# workers will wait after last loss computation
if counter.get() > params.update_treshold:
shared_p.synchronize()
shared_v.synchronize()
optimizer_p.step()
optimizer_v.step()
counter.reset()
shared_p.reset_grads()
shared_v.reset_grads()
traffic_light.switch() # workers start new loss computation
It’s looks to be somewhat still asynchronous. One quick test you can do to see if synchronous is that you should no longer have to use a special shared optimizer and should now work with just the regular pytorch optimizers
In fact, my implementation only solves the invertpendulum environment. I am searching for any help, but the paper is not clearly describing the pseudocode (a lot of sub-steps and hyperparameters are not detailed either in DPPO paper and PPO_clip paper). Besides, the only implementation I found is the official TF implementation by John Schulman, which I almost rigorously translated into Pytorch, without success.
Unfortunately for me, I can’t even try the TF implementation, since python3 version of mujoco seems incompatible with my machine . It could have at least provided me with some helpful expectation, like what looks like a “working” learning curve.
If any one has some useful documentation or an implementation that works (any language/library) of any close algorithm (TRPO, A2C), please tell me
What’s the whole updating old model right before you call backward and step about?
If I do it after the backward step or anywhere else, the old and new models will have same weights… I can also update the old model once, it is unclear in the paper if this update must be done once or at every steps.
Also are you asking for help on converting model to solve other environments or that model is not able to solve other environments?
According to the ppo paper, the set of hyperparameters I use in ppo.py should work in every environment with no modification at all. But only if I change both the batch size and the number of steps to 1000, it solves the InvertPendulum (quickly) and the InvertDoublePendulum (much slower than in the paper). With batch size = 64 and number of steps = 2048 as suggested by the paper, It solves nothing.
So I am quite certain there is something wrong somewhere in the training loop, but I can’t figure out what is wrong.
Hello, David. This is a really clear and concise implementation of DPPO! Only a question, how do you guarantee the variation in actions between different workers. In my game environment, every worker faces the same initial state, and I use torch.distributions.sample function to sample actions in worker. Unfortunately, I found that the continuous actions sampled by different workers are exactly the same! I use
to guarantee the reproducibility of the algorithm. So I think it might have some influence on the variation between workers. And I wonder how you solve it.