PyTorch 0.2.0 reinforce and PyTorch 1.0.1 alternative are giving different results

Lopa · April 12, 2019, 11:01pm

I am trying to train a controller using policy gradient reinforcement learning approach which iteratively update the policy based on sampled estimates of the reward. There are two codes one with PyTorch 0.2.0 using action.reinforce(reward) and the other with PyTorch 1.0.1 using the following modification.

The gist of the code with both PyTorch versions are provided below. TheY are giving me quite different results. PyTorch 1.0.1 result is converging within 10 epochs but to a worse local minima than PyTorch 0.2.0 result. The final PyTorch 0.2.0 result is better but has slightly higher variance over N rollouts than PyTorch 1.0.1 version. PyTorch 0.2.0 code is also taking around 5 times longer to run.

I am wondering if I am not doing the conversion between PyTorch versions properly or someone else has also noticed such differences in results and this can be a concern. Thank you.

RuntimeError: reinforce() was removed.
Use torch.distributions instead.
See https://pytorch.org/docs/master/distributions.html

Instead of:

probs = policy_network(state)
action = probs.multinomial()
next_state, reward = env.step(action)
action.reinforce(reward)
action.backward()

Use:

probs = policy_network(state)
# NOTE: categorical is equivalent to what used to be called multinomial
m = torch.distributions.Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward
loss.backward()

The gist of the PyTorch 0.2.0 code is:

# predict actions for N rollouts, where each action is a 51 dimensional vector
sumR = 0
for e in epochs:
    actions, rewards = [], []
    for _ in range(N):
        probs = controller(input)
        action = probs.multinomial()    # each action is a 51 dimensional vector
        
        # calculate reward
        reward = calculateReward(action)
        
        actions.append(action)
        rewards.append(reward)
    
    avgR = mean(rewards)
    b = sumR/ (e + 1)
    sumR += avgR
    
    # update controller
    for action in actions:
        actions.reinforce(avgR - b)
        optimizer.zero_grad()
        autograd.backward(action, [None for _ in action])
        optimizer.step()

The gist of the PyTorch 1.0.1 code is:

# predict actions for N rollouts, where each action is a 51 dimensional vector
sumR = 0
for e in epochs:
    logProbs, actions, reward = [], [], []
    for _ in range(N):
        probs = controller(input)
        m = torch.distributions.Categorical(probs)
        action = m.sample()    # each action is a 51 dimensional vector
        
        # calculate reward
        reward = calculateReward(action)
        
        logProbs.append(m.log_prob(action))    # each logProb is a 51 dimensional vector
        actions.append(action)
        rewards.append(reward)
    
    avgR = mean(rewards)
    b = sumR/ (e + 1)
    sumR += avgR
    
    # update controller
    for action, logProb in zip(actions, logProbs):
        loss = - logProb.sum() * (avgR - b)
        optimizer.zero_grad()
        loss.backward(retain_graph=True)
        optimizer.step()