I am currently trying to replicate in my own way the DQN tutorial; however for some reason as the DQN runs, the loss doesn’t change, and as such I believe the DQN isn’t being optimized. From looking at other similar questions, I think it has to deal with how I constructed the computational graph, but I don’t know exactly how to fix it from this particular situation.
def optimize_model():
batch = random.sample(D, batch_size)
state_batch = []
action_batch = []
reward_batch = []
next_state_batch = []
for state, action, reward, next_state in batch:
state_batch.append(state)
action_batch.append(action)
reward_batch.append(reward)
next_state_batch.append(next_state)
q_values = model(Variable(FloatTensor(state_batch)))
if len(D) < 128:
q_values = q_values.gather(1, Variable(torch.LongTensor([action_batch])))
else:
q_values = q_values.gather(1, Variable(torch.t(torch.LongTensor([action_batch]))))
next_state_values = model(Variable(FloatTensor(next_state_batch), volatile=True)).max(1)[0]
next_state_values.volatile = False
expected_state_action_values = (next_state_values * gamma) + Variable(FloatTensor(reward_batch))
loss = torch.nn.functional.smooth_l1_loss(q_values, expected_state_action_values)
optimizer.zero_grad()
loss.backward()
optimizer.step()