Am I training my model the right way?

The first code snippet is my implementation which I (think?) understood from pytorch. I wanted to implement the Deep Q learning algorithm without using frame like the one in the docs.
This is what I did

non_final_mask = torch.tensor(tuple(map(lambda s: s is not None, batch.next_state)), dtype=torch.bool)

# Stacking converts to shape [BATCH_SIZE, 8]
 non_final_next_states = torch.stack([s for s in batch.next_state if s is not None])

state_batch = torch.stack(batch.state)
reward_batch = torch.stack(batch.reward)
      
action_batch = torch.cat(batch.action)
      
 state_action_values = self.DQN(state_batch).gather(1, action_batch)

next_state_values = torch.zeros(self.BUFFER_SIZE)
next_state_values[non_final_mask] = self.DQN(non_final_next_states).max(1)[0]
       
expected_state_action_values = (next_state_values * self.GAMMA) + reward_batch.reshape(-1)

loss = self.DQN.loss(state_action_values, expected_state_action_values.unsqueeze(1))
        
self.DQN.optimizer.zero_grad()
loss.backward()
self.DQN.optimizer.step()

In the above code my average score does not go above -50. (Just a small note I stored the above transitions in the form of torch tensors)

However with this implementation

self.DQN.optimizer.zero_grad()
# Then proceeded to extract states, action and rewards
state_batch = torch.tensor(batch.state, dtype=torch.float32)
action_batch = torch.tensor(batch.action, dtype=torch.int64)
reward_batch = torch.tensor(batch.reward, dtype=torch.float32)

non_terminal_mask = torch.tensor(tuple(map(lambda s: s is not None, batch.next_state)), dtype=torch.bool)
non_terminal_state = torch.tensor([s for s in batch.next_state if s is not None], dtype=torch.float32)

state_action_values = self.DQN(state_batch)
target_state_action_values = state_action_values.clone()

next_state_action_values = torch.zeros(self.BUFFER_SIZE)
next_state_action_values[non_terminal_mask] = self.DQN(non_terminal_state).max(1)[0]

batch_index = torch.arange(0, self.BUFFER_SIZE, dtype=torch.int64)
target_state_action_values[batch_index, action_batch] = reward_batch + next_state_action_values * self.GAMMA

loss = self.DQN.loss(target_state_action_values, state_action_values)
loss.backward()
self.DQN.optimizer.step()

With the above code my score does not seem to go above 30. (Just a small note that above I stored the transitions as np arrays (directly from gym)).

I am concerned that there is something wrong with my model creation and learning pipelining. Any help would be appreciated.
Below is my model class.

class DQN(nn.Module):

    def __init__(self, input_size, hidden_size, output_size):
        super(DQN, self).__init__()
        self.input_layer = nn.Linear(input_size, hidden_size)
        self.hidden1 = nn.Linear(hidden_size, hidden_size)
        self.hidden2 = nn.Linear(hidden_size, hidden_size)
        self.output_layer = nn.Linear(hidden_size, output_size)
        self.loss = nn.MSELoss()
        self.optimizer = optim.Adam(self.parameters())

    def forward(self, state):
        x = F.relu(self.input_layer(state))
        x = F.relu(self.hidden1(x))
        x = F.relu(self.hidden2(x))
        actions = self.output_layer(x)

        return actions

Please check your code very carefully as most of the time you may have wrongly processed your data, mixing state & next_state, or missing something important.

Besides there are many DQN variations, including a single q net, Fixed target DQN, double DQN, and dueling DQN, even RAINBOW. So, only you can fix the problem yourself as it’s really hard trying to understand how you store and processes state, next_state, action, reward, terminal in your undocumented code snippet. Their shape and data structure are not clear.

I apologize for the nature of my post. With hindsight I have realized that it wasn’t well documented and it doesn’t really portray what I am trying to achieve. Since the time of posting I have decided to learn more about reward modelling and agent construction and hence haven’t move forward in anyway.
Thank you for your inputs, I’ll certainly keep it in mind for my future posts :smile:

It is ok, don’t be shy about it, if you could provide some shape hints like this:

keys = self.linear_keys(input)  # shape: (N, T, key_size)
query = self.linear_query(input)  # shape: (N, T, key_size)
values = self.linear_values(input)  # shape: (N, T, value_size)

It will make your code much easier to read and check. :smile:.

1 Like