PyTorch Network Training, But Tensorflow (same) Network is Not. Why?

I realize this question might be more for a Tensorflow forum but having trouble getting answers and you guys might know.

It’s been a week and I can not figure out why the PyTorch version trains and the Tensorflow version absolutely fails. As far as I can tell the networks are the same. At this point I know some specific things that are going wrong, but I don’t know why.

Simple Network

Input: 4
L1: (Input)4 -> 128
L2A: (L1)128 -> 1 # Used for Mu
L2B: (L1)128 -> 1 # Used for Sigma
# Use Mu and Sigma to define Normal Distribution
# Use Normal Distribution for Continuous Action Sampling and Training

I’m aware that due to the nature of PyTorch and Tensorflow the code looks a bit different, but at the network level, and at the update level I think they should both operate the same way.

A Picture is worth 1000 Words


CLUE NUMBER 1: Gradients in Tensorflow Blow Up

Above you can see that the gradients for PyTorch look good, but for Tensorflow they don’t (they blow up). I know there are ways I can use gradient clipping, and other tricks, but I’m hesitant to do that since I did not need to with PyTorch. There must be something different about the network.

CLUE NUMBER 2: Loss function in Tensorflow Goes Negative

Above you can see that the loss function for Tensorflow actually goes negative. This seems to be because the majority of the log_probs are positive, which is technically possible. But in PyTorch the majority of the log_probs are negative. Also in PyTorch it looks like loss is increasing but that is okay, I believe because as the episodes get longer, the returns for each step get longer too.

CLUE NUMBER 3: Sigmas on Tensorflow Start at one and drop to Zero.

If you look at the Sigmas on the Tensorflow graphs, they drop from an average of 1/episode to an average of 0/episode. This seems really important but I don’t know what to make of it.

Finally, Some Code. Both PyTorch and Tensorflow

I’ve Double-Double-Double checked and hyper-params are the same across the two.

PyTorch (Works. Trains easily within 1000 episodes):

# Code broken up into a Policy Class, and Agent Class and then the "" which runs the code.

class Policy(nn.Module):
    def __init__(self, hidden_size, num_inputs, action_space):
        super(Policy, self).__init__()
        self.action_space = action_space
        num_outputs = action_space.shape[0]
        self.linear1 = nn.Linear(num_inputs, hidden_size)
        self.linear2 = nn.Linear(hidden_size, num_outputs)
        self.linear2_ = nn.Linear(hidden_size, num_outputs)

    def forward(self, inputs):
        x = inputs
        x = self.linear1(x)
        x = F.relu(x)
        mu = self.linear2(x)
        sigma_sq = self.linear2_(x)
        sigma_sq = F.softplus(sigma_sq)

        return mu, sigma_sq

pi = Variable(torch.FloatTensor([math.pi]))

class Agent:
    def __init__(self, hidden_size, num_inputs, action_space):
        self.action_space = action_space
        self.model = Policy(hidden_size, num_inputs, action_space)
        self.optimizer = optim.Adam(self.model.parameters(), lr=1e-3)

    # probability density of x given a normal distribution
    # defined by mu and sigma
    def normal(self, x, mu, sigma_sq):
        a = (-1*(Variable(x)-mu).pow(2)/(2*sigma_sq)).exp()
        b = 1/(2*sigma_sq*pi.expand_as(sigma_sq)).sqrt()
        return a*b

    def select_action(self, state):
        state = Variable(state)
        mu, sigma_sq = self.model(state)

        # random scalar from normal distribution
        # with mean 0 and std 1
        random_from_normal = torch.randn(1)

        # modulate our normal (mu,sigma) with random_from_normal to pick an action.
        # Note that if x = random_from_normal, then our action is just:
        # mu + sigma * x
        sigma = sigma_sq.sqrt()
        action = (mu + sigma*Variable(random_from_normal)).data

        # calculate the probability density
        prob = self.normal(action, mu, sigma_sq)

        log_prob = prob.log()

        return action, log_prob

    def discount_rewards(self, rewards, gamma):
        stepReturn = torch.zeros(1, 1)
        stepReturns = []
        for i in range(len(rewards)):
            stepReturn = gamma * stepReturn + rewards[i]
        return list(reversed(stepReturns))

    def update_parameters(self, rewards, log_probs, gamma):
        discounted_rewards = self.discount_rewards(rewards, gamma)
        loss = 0
        for i in range(len(rewards)):
            foo = log_probs[i]*Variable(discounted_rewards[i])
            loss = loss + foo[0]
        loss = loss / len(rewards)
        loss = -loss

agent = Agent(args.hidden_size, env.observation_space.shape[0], env.action_space)
reward_sums = []
for i_episode in range(args.num_episodes):
    state = torch.Tensor([env.reset()])
    log_probs = []
    rewards = []
    for t in range(args.num_steps):
        action, log_prob = agent.select_action(state)
        next_state, reward, done, _ = env.step(action.numpy()[0])
        state = torch.Tensor([next_state])

        if done:

    agent.update_parameters(rewards, log_probs, args.gamma)

TensorFlow Version (Does Not Train)

class Policy:
  def __init__(self, hparams, session):
    self.session = session
    optimizer = tf.train.AdamOptimizer(hparams['learning_rate'])
    self.observations = tf.placeholder(tf.float32, shape=[None, 4], name="observations")
    self.actions = tf.placeholder(tf.float32, name="actions")
    self.returns = tf.placeholder(tf.float32, name="returns")
    normal = self.build_graph(hparams)
    self.action = normal.sample()
    log_probs = normal.log_prob(self.actions)
    loss = -tf.reduce_mean(tf.multiply(log_probs, self.returns))
    self.trainMe = optimizer.minimize(loss)

  def build_graph(self, hparams):
    hidden = tf.contrib.layers.fully_connected(

    mu = tf.contrib.layers.fully_connected(

    sigma_sq = tf.contrib.layers.fully_connected(

    sigma_sq = tf.nn.softplus(sigma_sq)
    sigma = tf.sqrt(sigma_sq)
    flat_sigma = tf.reshape(sigma,[-1])
    flat_mu = tf.reshape(mu,[-1])
    return tf.distributions.Normal(flat_mu, flat_sigma)

  def select_action(self, observation):
    feed = { self.observations: [observation] }
    return, feed_dict=feed)

  def update_parameters(self, observations, actions, returns, ep_index):
    feed = {
      self.observations: observations,
      self.actions: actions,
      self.returns: returns,
    }, feed_dict = feed)

class Agent:
  def __init__(self, hparams):
    self.hparams = hparams
    self.env = gym.make('RoboschoolInvertedPendulum-v2')

  def run(self):
    with tf.Graph().as_default(), tf.Session() as session:
      policy = Policy(self.hparams, session)

      for ep_index in range(self.hparams['max_episodes']):
        observations, actions, rewards = self.policy_rollout(policy)
        returns = self.discount_rewards(rewards)
        policy.update_parameters(observations, actions, returns, ep_index)

  def policy_rollout(self, policy):
    observation, reward, done = self.env.reset(), 0, False
    observations, actions, rewards  = [], [], []

    while not done:
      action = policy.select_action(observation)
      observation, reward, done, _ = self.env.step(action)

    return observations, actions, rewards

  def discount_rewards(self, rewards):
    discounted_rewards = np.zeros_like(rewards)
    running_add = 0
    for t in reversed(range(0, len(rewards))):
      running_add = running_add * self.hparams['gamma'] + rewards[t]
      discounted_rewards[t] = running_add
    return discounted_rewards

hparams = {
  'max_episodes': 5000,
  'gamma': 0.99,
  'hidden_size': 128,
  'learning_rate': 0.001

agent = Agent(hparams)

EDIT Added Mu and Sigma to graphs above. I think they help tell the story of what is wrong. Now we just need to figure out WHY :slight_smile:

EDIT I’m aware that the initialization for the two networks is slightly different, primarily that the bias for TF is init to zero and for PyT bias is init to sampling from std. I updated TF to do the same thing and it did not help.

At quick glance looks like tensorflow model updates after 5000 episodes which is long time between updates