A3c in pytorch.Does normal a3c with multinomial/categorical sampling works well in continuous STATE with discrete action as well?


I am trying to develop A3C reinforcement learning in pytorch.

But I am getting same action being taken on multinomial as well as categorical sampling of actions.

Below is my code…I need help in understanding whether simple A3C works well in continuous state space as well.


    while count<max_timesteps-1:
        episode_length += 1
        if done:
            cx = Variable(torch.zeros(1, params.state_dim))
            hx = Variable(torch.zeros(1, params.state_dim))
            cx = Variable(cx.data)
            hx = Variable(hx.data)
        values = []
        log_probs = []
        rewards = []
        entropies = []
        while count<max_timesteps-1:
            value, action_values, (hx, cx) = model((Variable(state.unsqueeze(0)), (hx, cx)))
            prob = F.softmax(action_values,dim = -1)
            log_prob = F.log_softmax(action_values, dim=-1)
            entropy = -(log_prob * prob).sum(1, keepdim=True)
            cdist = categorical.Categorical(prob)
            action = cdist.sample()
            log_prob = log_prob[0, Variable(action)].data
            state, reward, done = env.step(action)
            done = (done or count == max_timesteps-2)
            reward = max(min(reward, 1), -1)
            count +=1
            if done:
                episode_length = 0
                state = env.reset()
            print(ticker," action:",action, "reward ",reward)

            if done:
        R = torch.zeros(1, 1)
        if not done:
            value, _, _ = model((Variable(state.unsqueeze(0)), (hx, cx)))
            R = value.data
        policy_loss = 0
        value_loss = 0
        R = Variable(R)
        gae = torch.zeros(1, 1)
        for i in reversed(range(len(rewards))):
            R = params.gamma * R + rewards[i]
            advantage = R - values[i]
            value_loss = value_loss + 0.5 * advantage.pow(2)
            TD = rewards[i] + params.gamma * values[i + 1].data - values[i].data
            gae = gae * params.gamma * params.tau + TD
            policy_loss = policy_loss - log_probs[i] * Variable(gae) - 0.01 * entropies[i]

        (policy_loss + 0.5 * value_loss).backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 40)

A3C works in continuous space, but its unstable (even more than A2C), and produce poorer results than PPO / IMPALA(distributed)

For correct implementation, consider reference:


Thanks for the answer…could you please help to let know if there is a version of ppo with lstm.

If you just need ppo + lstm with 1-step back brop, there are. But it would be more tricky if you would like to have the BPTT (back-propagate through time feature)


I am unable to figure it out how PPO is to be implemented with LSTM.

Also,given the main objective of PPO is to have an update near to the current policy…is it same to clip the gradient in A3C to a small value and then expect the results to be same as PPO??

may I get help in how to implement this and if there a correct implementation of ppo+lstm can you please help to share.