I have the following equation:

```
w <- w + a[R + gamma * q(S', A', w) - q(S, A, w)] * gradient of q(S, A, w)
```

My code:

```
import gym
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import backward, Variable
class Policy(nn.Module):
def __init__(self):
super(Policy, self).__init__()
self.fc1 = nn.Linear(5, 128)
self.fc2 = nn.Linear(128, 1)
def forward(self, x):
x = F.tanh(self.fc1(x))
return F.sigmoid(self.fc2(x))
env = gym.make('CartPole-v0')
policy = Policy()
optimizer = optim.Adam(policy.parameters())
s = env.reset()
a = 0
s_next, reward, done, _ = env.step(a)
a_next = 1
x = torch.from_numpy(np.append(s, a)).float().unsqueeze(0)
x = Variable(x)
q_sa = policy(x)
x_next= torch.from_numpy(np.append(s_next, a_next)).float().unsqueeze(0)
x_next = Variable(x_next)
q_sa_next = policy(x_next)
alpha = 1
gamma = 1
update = alpha * (reward + gamma * q_sa_next - q_sa)
optimizer.zero_grad()
update.backward()
optimizer.step()
```

Am I implementing `update`

and propagating things correctly? I don’t quite understand how to handle the gradient part of the equation.