Rewards decreasing in DQN (multi-actions)

FinCdy · December 14, 2023, 5:22pm

Hi,
I’m trying to use a DQN to do reinforcement learning, but I think I’m doing wrong in the optimization part of the model… I’ve tried modifying the state and the function to give rewards, but the problem remains and it’s the following.
The rewards always drop more or less linearly (and noisy) during the episodes, so I think I’ve done an error in this code snippet:

    def optimize_model(self, state, actions, rewards, next_state, gamma):
        # Convert lists to tensors
        actions = torch.tensor(actions, dtype=torch.long)
        rewards = torch.tensor(rewards, dtype=torch.float)
        next_state = next_state.float()
        state = state.float()

        # Get current Q values -> get only the actions done 
        # (the output has 3 possible actions for each person: take only the one with max value for each person)
        state_action_values = self.policy_net(state)[actions]

        # Get next Q values
        with torch.no_grad():
            next_state_values = self.target_net(next_state)
        next_state_values = next_state_values.view(-1, 3)  # Reshape to (num_people, 3)
        next_actions = next_state_values.max(1)[1]  # Get the action done for each person
        next_state_values = next_state_values.gather(1, next_actions.unsqueeze(-1)).squeeze(-1).detach()

        # Compute the expected Q values
        expected_state_action_values = (next_state_values * gamma) + rewards
        # Compute loss
        loss = nn.functional.smooth_l1_loss(state_action_values, expected_state_action_values)

        # Optimize the model
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.policy_net.parameters(), max_norm=1)
        self.optimizer.step()

As added in the comments, the policy_net should take as input a state and return as output actions (3 for each person, whose number is given in a variable).
Are there conceptual mistakes in there?
Thank you very much