Q-values increase, also with negative TD

I set up a simple DQL task, but it wasn’t learning, so I decreased the number of steps and samples to 1 to track down the error.

The TD values are calculated correctly, but no matter if they are positive or negative, after the update, the corresponding Q-value increased. I use :

optimiser = torch.optim.Adam(model.parameters(), lr=1e-3, eps=1e-3, amsgrad=True)
gamma = 0.9
for episode in range(episodes):
TD = []

for step in range(run_length):

TDstep = reward + gamma*torch.max(Qnext.detach(), dim =1).values - Q[range(batch_size),action]
TD.append(TDstep)

TD = torch.stack(TD).sum()
TD.backward()

optimiser.step()
optimiser.zero_grad()

With example output:

action taken: 2
Q-values: tensor([-0.0832, -0.5799, 0.6861, -0.1725, -0.0538])
reward: tensor(-0.0050)
max Q(s+1): tensor([0.6873])
TD: tensor([-0.0726], grad_fn=)
action taken: 4
Q-values: tensor([-0.0823, -0.5819, 0.6942, -0.1716, -0.0544])
reward: tensor(-0.1900)
max Q(s+1): tensor([0.6797])
TD: tensor([0.4761], grad_fn=)
action taken: 2
Q-values: tensor([-0.0809, -0.5839, 0.7002, -0.1697, -0.0506])
reward: tensor(-0.0050)
max Q(s+1): tensor([0.6873])
TD: tensor([-0.0867], grad_fn=)
action taken: 1
Q-values: tensor([-0.0796, -0.5859, 0.7069, -0.1680, -0.0477])

As you can see, though the TD for action 2 in the first step is negative, the Q-value for action 2 increased after the update. The TD for action 4 is positive and the Q-value for action 4 also increases after the corresponding update.

I checked the updated weights and only the right weights are affected. So I’m clueless.

I have a similar problem to bad nobody answered
Did you find your error?