What's the right way of implementing policy gradient?

alexis-jacq · June 14, 2017, 8:22am

I think you should look at this topic explaining the tensor.reinforce method:

If the action is the result of a sampling, calling action.reinforce(r) acts as a policy gradient.
You can find a code example of implementation here:

github.com

pytorch/examples/blob/main/reinforcement_learning/reinforce.py

import argparse
import gym
import numpy as np
from itertools import count
from collections import deque
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical


parser = argparse.ArgumentParser(description='PyTorch REINFORCE example')
parser.add_argument('--gamma', type=float, default=0.99, metavar='G',
                    help='discount factor (default: 0.99)')
parser.add_argument('--seed', type=int, default=543, metavar='N',
                    help='random seed (default: 543)')
parser.add_argument('--render', action='store_true',
                    help='render the environment')
parser.add_argument('--log-interval', type=int, default=10, metavar='N',

This file has been truncated. show original