# Computing loss to maximize reward

I am trying to get a policy (with a neural network - I am using tanh activation functions, but any would work) by maximizing the reward I obtain from my environment. I am kind of new to reinforcement-learning and particularly to Pytorch, I would like to create an actor-only strategy that maximizes my rewards (which I get from a simulated environment) over a continuous state environment (hence using a parametrized policy).

I have the following network structure:

``````class PolicyNetwork(nn.Module):
def __init__(self):
super(PolicyNetwork, self).__init__()
self.affine1 = nn.Linear(3, 6)
self.affine2 = nn.Linear(6, 6)
self.affine3 = nn.Linear(6, 1)

self.rewards = []

def forward(self, x):
action = F.tanh(self.affine1(x)) # maybe change to linear
action = F.tanh(self.affine2(action))
action = F.tanh(self.affine2(action))
return self.affine3(action)

policy = PolicyNetwork()
``````

My environment computes scalar rewards (numpy float64) at every iteration, and I do the following:

``````policy.rewards.append(reward)
``````

Then, at the end of each episode I do:

``````finish_episode(gamma,base_line)
``````

with

``````gamma=0.99
criterion = nn.MSELoss()
base_line=1.0 # this is unreachable given my environment - I know maximum is 0.6

def finish_episode(gamma,base_line):
R = 0
rewards = []
for r in policy.rewards[::-1]:
R = r + gamma * R
rewards.insert(0, R)
rewards = torch.tensor(rewards)
loss=0
for reward in rewards:
loss += criterion(reward, base_line)
loss.backward()
optimizer.step()
del policy.rewards[:]
``````

and I get the following error: AttributeError: ‘float’ object has no attribute ‘requires_grad’

I have also tried:

``````base_line=Variable(Variable(torch.ones(2), requires_grad=True))
base_line=base_line
``````

and get the error “RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn”

I guess there is a simple way o do this, but have just not found it. Any suggestions would be greatly appreciated.