Computing loss to maximize reward

I am trying to get a policy (with a neural network - I am using tanh activation functions, but any would work) by maximizing the reward I obtain from my environment. I am kind of new to reinforcement-learning and particularly to Pytorch, I would like to create an actor-only strategy that maximizes my rewards (which I get from a simulated environment) over a continuous state environment (hence using a parametrized policy).

I have the following network structure:

class PolicyNetwork(nn.Module):
    def __init__(self):
        super(PolicyNetwork, self).__init__()
        self.affine1 = nn.Linear(3, 6)
        self.affine2 = nn.Linear(6, 6)
        self.affine3 = nn.Linear(6, 1)

        self.rewards = []

    def forward(self, x):
        action = F.tanh(self.affine1(x)) # maybe change to linear
        action = F.tanh(self.affine2(action))
        action = F.tanh(self.affine2(action))
        return self.affine3(action)

policy = PolicyNetwork()

My environment computes scalar rewards (numpy float64) at every iteration, and I do the following:


Then, at the end of each episode I do:



criterion = nn.MSELoss()
base_line=1.0 # this is unreachable given my environment - I know maximum is 0.6

def finish_episode(gamma,base_line):
    R = 0
    rewards = []
    for r in policy.rewards[::-1]:
        R = r + gamma * R
        rewards.insert(0, R)
    rewards = torch.tensor(rewards)
    for reward in rewards:
        loss += criterion(reward, base_line)
    del policy.rewards[:]

and I get the following error: AttributeError: ‘float’ object has no attribute ‘requires_grad’

I have also tried:

base_line=Variable(Variable(torch.ones(2), requires_grad=True))

and get the error “RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn”

I guess there is a simple way o do this, but have just not found it. Any suggestions would be greatly appreciated.