A3c in pytorch.Does normal a3c with multinomial/categorical sampling works well in continuous STATE with discrete action as well?

granth_jain · August 30, 2020, 4:09pm

Hi,

I am trying to develop A3C reinforcement learning in pytorch.

But I am getting same action being taken on multinomial as well as categorical sampling of actions.

Below is my code…I need help in understanding whether simple A3C works well in continuous state space as well.

Thanks,
Granth

    while count<max_timesteps-1:
        episode_length += 1
        if done:
            cx = Variable(torch.zeros(1, params.state_dim))
            hx = Variable(torch.zeros(1, params.state_dim))
        else:
            cx = Variable(cx.data)
            hx = Variable(hx.data)
        values = []
        log_probs = []
        rewards = []
        entropies = []
        while count<max_timesteps-1:
            value, action_values, (hx, cx) = model((Variable(state.unsqueeze(0)), (hx, cx)))
            prob = F.softmax(action_values,dim = -1)
            log_prob = F.log_softmax(action_values, dim=-1)
            entropy = -(log_prob * prob).sum(1, keepdim=True)
            entropies.append(entropy)
            cdist = categorical.Categorical(prob)
            action = cdist.sample()
            log_prob = log_prob[0, Variable(action)].data
            state, reward, done = env.step(action)
            done = (done or count == max_timesteps-2)
            reward = max(min(reward, 1), -1)
            
            
            count +=1
            
            if done:
                episode_length = 0
                state = env.reset()
                
            
            values.append(value)
            log_probs.append(log_prob)
            rewards.append(reward)
            print(ticker," action:",action, "reward ",reward)

            if done:
                break
            
        R = torch.zeros(1, 1)
        if not done:
            value, _, _ = model((Variable(state.unsqueeze(0)), (hx, cx)))
            R = value.data
        values.append(Variable(R))
        policy_loss = 0
        value_loss = 0
        R = Variable(R)
        gae = torch.zeros(1, 1)
        for i in reversed(range(len(rewards))):
            R = params.gamma * R + rewards[i]
            advantage = R - values[i]
            value_loss = value_loss + 0.5 * advantage.pow(2)
            TD = rewards[i] + params.gamma * values[i + 1].data - values[i].data
            gae = gae * params.gamma * params.tau + TD
            policy_loss = policy_loss - log_probs[i] * Variable(gae) - 0.01 * entropies[i]

        optimizer.zero_grad()
        (policy_loss + 0.5 * value_loss).backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 40)
        optimizer.step()

iffiX · August 31, 2020, 4:18am

A3C works in continuous space, but its unstable (even more than A2C), and produce poorer results than PPO / IMPALA(distributed)

For correct implementation, consider reference:

github.com

iffiX/machin/blob/b1a8b0ce99be9e4d47e132c1325e1aaedb87e0a4/machin/frame/algorithms/a2c.py#L312


will be cleared after update is finished.

Args:
    update_value: Whether update the Q network.
    update_policy: Whether update the actor network.
    concatenate_samples: Whether concatenate the samples.

Returns:
    mean value of estimated policy value, value loss
"""
sum_act_loss = 0
sum_value_loss = 0
self.actor.train()
self.critic.train()
for _ in range(self.actor_update_times):
    # sample a batch
    batch_size, (state, action, advantage) = \
        self.replay_buffer.sample_batch(self.batch_size,
                                        sample_method="random_unique",
                                        concatenate=concatenate_samples,
                                        sample_attrs=[

github.com

iffiX/machin/blob/b1a8b0ce99be9e4d47e132c1325e1aaedb87e0a4/machin/frame/algorithms/a3c.py#L131


    if self.is_syncing:
        self.critic_grad_server.pull(self.critic)
    return super(A3C, self)._criticize(state)

def update(self,
           update_value=True,
           update_policy=True,
           concatenate_samples=True,
           **__):
    # DOC INHERITED
    org_sync = self.is_syncing
    self.is_syncing = False
    super(A3C, self).update(update_value, update_policy,
                            concatenate_samples)
    self.is_syncing = org_sync
    self.actor_grad_server.push(self.actor)
    self.critic_grad_server.push(self.critic)

granth_jain · August 31, 2020, 3:54pm

Hi,

Thanks for the answer…could you please help to let know if there is a version of ppo with lstm.

iffiX · September 1, 2020, 1:48am

If you just need ppo + lstm with 1-step back brop, there are. But it would be more tricky if you would like to have the BPTT (back-propagate through time feature)

granth_jain · September 13, 2020, 6:11pm

Hi,

I am unable to figure it out how PPO is to be implemented with LSTM.

Also,given the main objective of PPO is to have an update near to the current policy…is it same to clip the gradient in A3C to a small value and then expect the results to be same as PPO??

may I get help in how to implement this and if there a correct implementation of ppo+lstm can you please help to share.

Thanks