I am using the openai gym cartpole-v0 environment. Here the code that doesn’t work. (optimizer.zero_grad() and optimizer.step() are performed outside the function)

```
def make_step(model, optimizer, criterion, observation, action, reward, next_observation):
inp = torch.from_numpy(observation)
target = model(torch.from_numpy(observation)).detach().numpy()
next_target = model(torch.from_numpy(next_observation)).detach().numpy()
new_reward = np.max(next_target)
target[action] = reward
target[action] += new_reward
obv_reward = model(inp.double())
target_reward = torch.from_numpy(target)
loss = criterion(obv_reward, target_reward)
loss.backward()
```

On running the code, the agent learns nothing and achieves no more than 10 reward.

Now if I flip the gamma term to the left and remove the network’s foresight, it does slightly better, achieving around 30-120 reward.

```
def make_step(model, optimizer, criterion, observation, action, reward, next_observation):
inp = torch.from_numpy(observation)
target = model(torch.from_numpy(observation)).detach().numpy()
#next_target = model(torch.from_numpy(next_observation)).detach().numpy()
#new_reward = np.max(next_target)
target[action] = reward
#target[action] += new_reward
obv_reward = model(inp.double()) - model(torch.from_numpy(next_observation))
target_reward = torch.from_numpy(target)
loss = criterion(obv_reward, target_reward)
loss.backward()
```

Why is the first one not working and how do I fix it?