I am trying to get a policy (with a neural network - I am using tanh activation functions, but any would work) by maximizing the reward I obtain from my environment. I am kind of new to reinforcement-learning and particularly to Pytorch, I would like to create an actor-only strategy that maximizes my rewards (which I get from a simulated environment) over a continuous state environment (hence using a parametrized policy).
I have the following network structure:
class PolicyNetwork(nn.Module):
def __init__(self):
super(PolicyNetwork, self).__init__()
self.affine1 = nn.Linear(3, 6)
self.affine2 = nn.Linear(6, 6)
self.affine3 = nn.Linear(6, 1)
self.rewards = []
def forward(self, x):
action = F.tanh(self.affine1(x)) # maybe change to linear
action = F.tanh(self.affine2(action))
action = F.tanh(self.affine2(action))
return self.affine3(action)
policy = PolicyNetwork()
My environment computes scalar rewards (numpy float64) at every iteration, and I do the following:
policy.rewards.append(reward)
Then, at the end of each episode I do:
finish_episode(gamma,base_line)
with
gamma=0.99
criterion = nn.MSELoss()
base_line=1.0 # this is unreachable given my environment - I know maximum is 0.6
def finish_episode(gamma,base_line):
R = 0
rewards = []
for r in policy.rewards[::-1]:
R = r + gamma * R
rewards.insert(0, R)
rewards = torch.tensor(rewards)
loss=0
for reward in rewards:
loss += criterion(reward, base_line)
optimizer.zero_grad()
loss.backward()
optimizer.step()
del policy.rewards[:]
and I get the following error: AttributeError: ‘float’ object has no attribute ‘requires_grad’
I have also tried:
base_line=Variable(Variable(torch.ones(2), requires_grad=True))
base_line=base_line[0]
and get the error “RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn”
I guess there is a simple way o do this, but have just not found it. Any suggestions would be greatly appreciated.