Continuous action A3C

AjayTalati · March 14, 2017, 12:33am

Hi,

I wonder if anyone has got A3C working with continuous actions? I guessed it would be a good idea to ask first before trying to do it, as there’s probably a good reason why no one’s got it to work yet?

I’m working on modifying, @Ilya_Kostrikov, implementation,

So for example Open AIs pendulum, as only got a state/observation vector of 3, so there’s no need for any conv’s in the Actor-Critic module, basically I’m trying,

lstm_out = 256
enc_in = 3 # for pendulum
enc_hidden = 200
enc_out = lstm_out

class ActorCritic(nn.Module):

    def __init__(self , lstm_in ):
        super(ActorCritic, self).__init__( )  
        self.fc_enc_in  = nn.Linear(enc_in,enc_hidden) # enc_input_layer
        self.fc_enc_out  = nn.Linear(enc_hidden,enc_out) # enc_output_layer             
        self.lstm = nn.LSTMCell(lstm_in, lstm_out)
        self.actor_mu = nn.Linear(lstm_out, 1)
        self.actor_sigma = nn.Linear(lstm_out, 1)
        self.critic_linear = nn.Linear(lstm_out, 1)
        self.train()

    def forward(self, inputs):
        
        x, (hx, cx) = inputs

        x = F.relu(self.fc_enc_in(x))
        x = self.fc_enc_out(x)

        hx, cx = self.lstm(x, (hx, cx))
        x = hx

        return self.critic_linear(x), self.actor_mu(x), self.actor_sigma(x), (hx, cx)

The initialisation code in main.py, then looks like,

env = gym.envs.make("Pendulum-v0")
lstm_in = 3    
global_model = ActorCritic( lstm_in )
global_model.share_memory()
local_model = ActorCritic( lstm_in )

And the training code is where I get confused (as usual) ???,

env = gym.envs.make("Pendulum-v0")
s0 = env.reset()
done = True
state = torch.from_numpy(s0).float().unsqueeze(0) 
value, mu, sigma, (hx, cx) = local_model((Variable(state), (hx, cx)))

#mu = mu.clamp(-1, 1) # constain to sensible values 
Softplus=nn.Softplus()     
sigma = Softplus(sigma + 1e-5) # constrain to sensible values
normal_dist = torch.normal(mu, sigma) 

prob = normal_dist
log_prob = torch.log(prob)
entropy = 0.5 * (torch.log(2. * np.pi * sigma ) + 1.)

##--------------------------------------------------------------
# TODO Calculate the Gaussian neg log-likelihood, log(1/sqrt(2sigma^2pi)) - (x - mu)^2/(2*sigma^2)
# See - https://www.statlect.com/fundamentals-of-statistics/normal-distribution-maximum-likelihood
#
log_prob = torch.log(torch.pow( torch.sqrt(2. * sigma * np.pi) , -1)) - (normal_dist - mu)*(normal_dist - mu)*torch.pow((2. * sigma), -1)
##--------------------------------------------------------------

action = Variable( prob.data )

#action=[0,]
state, reward, done, _ = env.step([action.data[0][0]])

References,

Reference - Deepmind A3C’s paper, https://arxiv.org/pdf/1602.01783.pdf
Section 9 - Continuous Action Control Using the MuJoCo Physics Simulator

Here’s a diagram of the algorithm, from GitHub - deeplearninc/relaax: Reinforcement Learning framework to facilitate development and use of scalable RL algorithms and applications

AjayTalati · March 14, 2017, 8:02am

How do you do a logarithm, in PyTorch?

>>> nnlog = nn.Log()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'torch.nn' has no attribute 'Log'

jekbradbury · March 14, 2017, 8:14am

Many elementwise operations, especially traditional mathematical ones, are in the torch namespace (so torch.log(var) or var.log() here). If they don’t give an error when applied to a Variable, they’re differentiable. In general you don’t need to instantiate modules like Softplus; the versions in nn are provided to make it easier to use nn.Sequential and all parameterless modules in nn have a simpler functional equivalent in F.

AjayTalati · March 14, 2017, 8:19am

@jekbradbury thank you very much !!!

AjayTalati · March 18, 2017, 5:18pm

Well, I came back to this after a few days, and I’m still stuck. So any advice will make you a genius in my view?

Here’s a post of my code as simple as I could make it as a big blob,

gist.github.com

https://gist.github.com/AjayTalati/184fec867380f6fa22b9aa0951143dec

main_single.py

# Reference - Deepmind A3C's paper, https://arxiv.org/pdf/1602.01783.pdf
# See section 9 - Continuous Action Control Using the MuJoCo Physics Simulator
#
# Code based on https://github.com/pfre00/a3c

import argparse
import torch
import torch.multiprocessing as mp
import gym
import torch.nn as nn

This file has been truncated. show original

I keep getting this error,

File "main_single.py", line 174, in <module>
value_loss = value_loss + advantage.pow(2)
AttributeError: 'numpy.ndarray' object has no attribute 'pow'

I don’t understand why advantage has become a numpy.array instead of a torch.tensor - it never occurred with the discrete action implementation?

Any ideas what I’ve got wrong?

Thanks a lot for your help,

Best,

Ajay

jekbradbury · March 18, 2017, 6:42pm

reward is probably returned from gym as a numpy object (I guess a scalar?) so I think you have to convert it?

AjayTalati · March 18, 2017, 8:25pm

Hi, @jekbradbury thanks a lot!

I tired conversion to a torch tensor but couldn’t get it to work - I’ll try again thought?

What seems to help a little is, changing the code to

    for t in reversed(range(len(rewards))):
        R = torch.mul(R, args.gamma)  
        R = torch.add(R, rewards[t])
        advantage = R - values[t]
        value_loss = value_loss + advantage.pow(2)

Now I get the error,

  File "main_single.py", line 185, in <module>
    (policy_loss + 0.5 * value_loss).backward()
  File "/home/ajay/anaconda3/envs/pyphi/lib/python3.6/site-packages/torch/autograd/variable.py", line 158, in backward
    self._execution_engine.run_backward((self,), (gradient,), retain_variables)
  File "/home/ajay/anaconda3/envs/pyphi/lib/python3.6/site-packages/torch/autograd/stochastic_function.py", line 13, in _do_backward
    raise RuntimeError("differentiating stochastic functions requires "
RuntimeError: differentiating stochastic functions requires providing a reward

Which is perhaps a little bit better than before? I think gym environments are a bit strange?

apaszke · March 18, 2017, 10:00pm

You forgot to call .reinforce on some of the stochastic outputs.

AjayTalati · March 19, 2017, 9:55pm

@apaszke Fan Q

… this is padding to make this post 20 characters or more

andrewliao11 · April 14, 2017, 5:07pm

Hi @AjayTalati, It’s really nice to see someone also work on this, I’ve also implemented continuous a3c and got some result on mujoco envs, you can check this out: https://github.com/andrewliao11/pytorch-a3c-mujoco

AjayTalati · April 14, 2017, 6:39pm

Hi @andrewliao11,

great stuff !!! That’s really cool, nice videos

I never managed to get it working very well, (I tried it on non-mujoco stuff), so went back to experimenting with the discrete actions version. Do you plan on experimenting with shared RMSProp?

A3C is a great tool - you can apply it to a lot of stuff - it should be really helpful to you in the future!

Kind regards,

Ajay

andrewliao11 · April 15, 2017, 2:27am

I’ll try ShareRMSProp in the near future!
However, I think continuous a3c is a little unstable (you can refer to the learning curve here).
The problem might be the insufficient threads, which causes the async update fails (unable to reduce the correlation btn data)

alexis-jacq · July 9, 2017, 7:40pm

Thanks for sharing your code!
It seems that you keep exploring in your test.py, at line 79:

action = (mu + sigma_sq.sqrt()*Variable(eps)).data

But don’t you should just exploit with action = mu? It may explain the instability displayed by your learning curves