I set up a simple DQL task, but it wasn’t learning, so I decreased the number of steps and samples to 1 to track down the error.

The TD values are calculated correctly, but no matter if they are positive or negative, after the update, the corresponding Q-value increased. I use :

optimiser = torch.optim.Adam(model.parameters(), lr=1e-3, eps=1e-3, amsgrad=True)

gamma = 0.9

for episode in range(episodes):

TD = []

…

for step in range(run_length):

…

TDstep = reward + gamma*torch.max(Qnext.detach(), dim =1).values - Q[range(batch_size),action]

TD.append(TDstep)

```
TD = torch.stack(TD).sum()
TD.backward()
optimiser.step()
optimiser.zero_grad()
```

With example output:

action taken: 2

Q-values: tensor([-0.0832, -0.5799, 0.6861, -0.1725, -0.0538])

reward: tensor(-0.0050)

max Q(s+1): tensor([0.6873])

TD: tensor([-0.0726], grad_fn=)

action taken: 4

Q-values: tensor([-0.0823, -0.5819, 0.6942, -0.1716, -0.0544])

reward: tensor(-0.1900)

max Q(s+1): tensor([0.6797])

TD: tensor([0.4761], grad_fn=)

action taken: 2

Q-values: tensor([-0.0809, -0.5839, 0.7002, -0.1697, -0.0506])

reward: tensor(-0.0050)

max Q(s+1): tensor([0.6873])

TD: tensor([-0.0867], grad_fn=)

action taken: 1

Q-values: tensor([-0.0796, -0.5859, 0.7069, -0.1680, -0.0477])

As you can see, though the TD for action 2 in the first step is negative, the Q-value for action 2 increased after the update. The TD for action 4 is positive and the Q-value for action 4 also increases after the corresponding update.

I checked the updated weights and only the right weights are affected. So I’m clueless.