Continuous action A3C


I wonder if anyone has got A3C working with continuous actions? I guessed it would be a good idea to ask first before trying to do it, as there’s probably a good reason why no one’s got it to work yet?

I’m working on modifying, @Ilya_Kostrikov, implementation,

So for example Open AIs pendulum, as only got a state/observation vector of 3, so there’s no need for any conv’s in the Actor-Critic module, basically I’m trying,

lstm_out = 256
enc_in = 3 # for pendulum
enc_hidden = 200
enc_out = lstm_out

class ActorCritic(nn.Module):

    def __init__(self , lstm_in ):
        super(ActorCritic, self).__init__( )  
        self.fc_enc_in  = nn.Linear(enc_in,enc_hidden) # enc_input_layer
        self.fc_enc_out  = nn.Linear(enc_hidden,enc_out) # enc_output_layer             
        self.lstm = nn.LSTMCell(lstm_in, lstm_out)
        self.actor_mu = nn.Linear(lstm_out, 1)
        self.actor_sigma = nn.Linear(lstm_out, 1)
        self.critic_linear = nn.Linear(lstm_out, 1)

    def forward(self, inputs):
        x, (hx, cx) = inputs

        x = F.relu(self.fc_enc_in(x))
        x = self.fc_enc_out(x)

        hx, cx = self.lstm(x, (hx, cx))
        x = hx

        return self.critic_linear(x), self.actor_mu(x), self.actor_sigma(x), (hx, cx)

The initialisation code in, then looks like,

env = gym.envs.make("Pendulum-v0")
lstm_in = 3    
global_model = ActorCritic( lstm_in )
local_model = ActorCritic( lstm_in )

And the training code is where I get confused (as usual) ???,

env = gym.envs.make("Pendulum-v0")
s0 = env.reset()
done = True
state = torch.from_numpy(s0).float().unsqueeze(0) 
value, mu, sigma, (hx, cx) = local_model((Variable(state), (hx, cx)))

#mu = mu.clamp(-1, 1) # constain to sensible values 
sigma = Softplus(sigma + 1e-5) # constrain to sensible values
normal_dist = torch.normal(mu, sigma) 

prob = normal_dist
log_prob = torch.log(prob)
entropy = 0.5 * (torch.log(2. * np.pi * sigma ) + 1.)

# TODO Calculate the Gaussian neg log-likelihood, log(1/sqrt(2sigma^2pi)) - (x - mu)^2/(2*sigma^2)
# See -
log_prob = torch.log(torch.pow( torch.sqrt(2. * sigma * np.pi) , -1)) - (normal_dist - mu)*(normal_dist - mu)*torch.pow((2. * sigma), -1)

action = Variable( )

state, reward, done, _ = env.step([[0][0]])


Reference - Deepmind A3C’s paper,
Section 9 - Continuous Action Control Using the MuJoCo Physics Simulator

Here’s a diagram of the algorithm, from


How do you do a logarithm, in PyTorch?

>>> nnlog = nn.Log()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'torch.nn' has no attribute 'Log'

Many elementwise operations, especially traditional mathematical ones, are in the torch namespace (so torch.log(var) or var.log() here). If they don’t give an error when applied to a Variable, they’re differentiable. In general you don’t need to instantiate modules like Softplus; the versions in nn are provided to make it easier to use nn.Sequential and all parameterless modules in nn have a simpler functional equivalent in F.

1 Like

@jekbradbury thank you very much !!!

Well, I came back to this after a few days, and I’m still stuck. So any advice will make you a genius in my view?

Here’s a post of my code as simple as I could make it as a big blob,

I keep getting this error,

File "", line 174, in <module>
value_loss = value_loss + advantage.pow(2)
AttributeError: 'numpy.ndarray' object has no attribute 'pow'

I don’t understand why advantage has become a numpy.array instead of a torch.tensor - it never occurred with the discrete action implementation?

Any ideas what I’ve got wrong?

Thanks a lot for your help,



reward is probably returned from gym as a numpy object (I guess a scalar?) so I think you have to convert it?

1 Like

Hi, @jekbradbury thanks a lot!

I tired conversion to a torch tensor but couldn’t get it to work - I’ll try again thought?

What seems to help a little is, changing the code to

    for t in reversed(range(len(rewards))):
        R = torch.mul(R, args.gamma)  
        R = torch.add(R, rewards[t])
        advantage = R - values[t]
        value_loss = value_loss + advantage.pow(2)

Now I get the error,

  File "", line 185, in <module>
    (policy_loss + 0.5 * value_loss).backward()
  File "/home/ajay/anaconda3/envs/pyphi/lib/python3.6/site-packages/torch/autograd/", line 158, in backward
    self._execution_engine.run_backward((self,), (gradient,), retain_variables)
  File "/home/ajay/anaconda3/envs/pyphi/lib/python3.6/site-packages/torch/autograd/", line 13, in _do_backward
    raise RuntimeError("differentiating stochastic functions requires "
RuntimeError: differentiating stochastic functions requires providing a reward

Which is perhaps a little bit better than before? I think gym environments are a bit strange?

You forgot to call .reinforce on some of the stochastic outputs.

1 Like

@apaszke Fan Q

… this is padding to make this post 20 characters or more

Hi @AjayTalati, It’s really nice to see someone also work on this, I’ve also implemented continuous a3c and got some result on mujoco envs, you can check this out:


Hi @andrewliao11,

great stuff !!! That’s really cool, nice videos :smile:

I never managed to get it working very well, (I tried it on non-mujoco stuff), so went back to experimenting with the discrete actions version. Do you plan on experimenting with shared RMSProp?

A3C is a great tool - you can apply it to a lot of stuff - it should be really helpful to you in the future!

Kind regards,


I’ll try ShareRMSProp in the near future!
However, I think continuous a3c is a little unstable (you can refer to the learning curve here).
The problem might be the insufficient threads, which causes the async update fails (unable to reduce the correlation btn data)

Thanks for sharing your code!
It seems that you keep exploring in your, at line 79:

action = (mu + sigma_sq.sqrt()*Variable(eps)).data

But don’t you should just exploit with action = mu? It may explain the instability displayed by your learning curves

1 Like