Unnormalize rewards of final network

CesMak · June 15, 2020, 12:13pm

Hey there,

I trained a NN using PPO. My network gives me the action I should do for a given state and the estimated value for that state and action. I trained the network with normalized rewards:

        rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-5)

Questions:

In practice (when using the NN) I just get the normalized estimated value - is there any way to get the true estimated value? - I do not have rewards.mean() etc. thus I cannot just calculate it as expalined here.
What is the estimated value of my value function ? Does it simply rate the actual state and action or does it evaluate the given state and gives me a hint on the final score?
What I actually want is a network that predicts me the final score for a given state. Can I use the value function to achieve this? Or should I use something different -> train a seperate NN?

My code is at:

github.com

CesMak/mcts_cardgame/blob/master/modules/ppo_witches_v4.py

import torch
import torch.nn as nn
from torch.distributions import Categorical
import gym
import gym_witches_multiv2
import datetime

# For exporting the model:
import torch.onnx
import onnx
import onnxruntime

import numpy as np
import os
import random

from copy import deepcopy # used for baches and memory

# use ray for remote / parallel playing games speed up of 40%
import ray #pip install ray[rllib]

This file has been truncated. show original

A snippet of my network is:


#Actor Model:
class ActorModel(nn.Module):
    def __init__(self, state_dim, action_dim, n_latent_var):
        super(ActorModel, self).__init__()
        self.a_dim   = action_dim

        self.ac      = nn.Linear(state_dim, n_latent_var)
        self.ac_prelu= nn.PReLU()
        self.ac1      = nn.Linear(n_latent_var, n_latent_var)
        self.ac1_prelu= nn.PReLU()

        # Actor layers:
        self.a1      = nn.Linear(n_latent_var+action_dim, action_dim)

        # Critic layers:
        self.c1      = nn.Linear(n_latent_var, n_latent_var)
        self.c1_prelu= nn.PReLU()
        self.c2      = nn.Linear(n_latent_var, 1)

    def forward(self, input):
        # For 4 players each 15 cards on hand:
        # input=on_table(60)+ on_hand(60)+ played(60)+ play_options(60)+ add_states(15)
        # add_states = color free (4)+ would win (1) = 5  for each player
        #input.shape  = 15*4*4=240+3*5 (add_states) = 255

        #Actor and Critic:
        ac = self.ac(input)
        ac = self.ac_prelu(ac)
        ac = self.ac1(ac)
        ac = self.ac1_prelu(ac)

        # Get Actor Result:
        if len(input.shape)==1:
            options = input[self.a_dim*3:self.a_dim*4]
            actor_out =torch.cat( [ac, options], 0)
        else:
            options = input[:, self.a_dim*3:self.a_dim*4]
            actor_out   = torch.cat( [ac, options], 1)
        actor_out = self.a1(actor_out)
        actor_out = actor_out.softmax(dim=-1)

        # Get Critic Result:
        critic = self.c1(ac)
        critic = self.c1_prelu(critic)
        critic = self.c2(critic)

        return actor_out, critic

class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim, n_latent_var):
        super(ActorCritic, self).__init__()
        self.a_dim   = action_dim

        # actor critic
        self.actor_critic = ActorModel(state_dim, action_dim, n_latent_var)

    def act(self, state, memory):
        if type(state) is np.ndarray:
            state = torch.from_numpy(state).float()
        action_probs, _ = self.actor_critic(state)
        # here make a filter for only possible actions!
        #action_probs = action_probs *state[self.a_dim*3:self.a_dim*4]
        dist = Categorical(action_probs)
        action = dist.sample()# -> gets the lowest non 0 value?!

        if memory is not None:
            #necessary to convet all to numpy otherwise deepcopy not possible!
            memory.states.append(state.data.numpy())
            memory.actions.append(int(action.data.numpy()))
            memory.logprobs.append(float(dist.log_prob(action).data.numpy()))

        return action.item()

    def evaluate(self, state, action):
        action_probs, state_value = self.actor_critic(state)
        dist = Categorical(action_probs)

        action_logprobs = dist.log_prob(action)
        dist_entropy    = dist.entropy()
        return action_logprobs, torch.squeeze(state_value), dist_entropy

iffiX · June 15, 2020, 4:58pm

Since you are just using the normalized target value calculated using the MCTS method below:

def monteCarloRewards(self, memory):
        # Monte Carlo estimate of state rewards:
        # see: https://medium.com/@zsalloum/monte-carlo-in-reinforcement-learning-the-easy-way-564c53010511
        rewards = []
        discounted_reward = 0
        for reward, is_terminal in zip(reversed(memory.rewards), reversed(memory.is_terminals)):
            if is_terminal:
                discounted_reward = 0
            discounted_reward = reward + (self.gamma * discounted_reward)
            rewards.append(discounted_reward)
        rewards.reverse()
        # Normalizing the rewards:
        rewards = torch.tensor(rewards)
        rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-5)
        return rewards

and not using the generalized advantage function (GAE):

Your critic will only work if your network generates two output from one state input, i.e. your actor and critic stems from the same root, and you are doing this correctly. Otherwise critic will not affect the training.
(P.S I have accidentally seperated the actor and the critic before, and it seems that actor will work fine just with enough MCTS samples, even without the critic.)

So, answers for your questions:
Q1 and Q2. Critic here just serves as a supplementary gradient source, so if you would like your critic to just optimize on (unormalized) MCTS score, it will work fine, but will probably be unstable.
Q3. Of course you can, you can even remove the critic completely (see the PS section above) and just let your actor optimize on the log_prob * target_value, but it would likely be unstable.

So in general, if you would like to get a “True” estimation of your target value, I would suggest you train a different network, and not touch your actor-critic setup here.

If you are using the GAE function, then in your implementation, you may seperate your critic and actor, your critic must directly optmize on the target value and the output of your critic will be used by GAE, and your actor will optimize on the normalized “advantage” value given by the GAE function, in this case, I would suggest you use the critic to directly give a future value prediction.

Am I clear? @ me if you still have any questions.