The first code snippet is my implementation which I (think?) understood from pytorch. I wanted to implement the Deep Q learning algorithm without using frame like the one in the docs.

This is what I did

```
non_final_mask = torch.tensor(tuple(map(lambda s: s is not None, batch.next_state)), dtype=torch.bool)
# Stacking converts to shape [BATCH_SIZE, 8]
non_final_next_states = torch.stack([s for s in batch.next_state if s is not None])
state_batch = torch.stack(batch.state)
reward_batch = torch.stack(batch.reward)
action_batch = torch.cat(batch.action)
state_action_values = self.DQN(state_batch).gather(1, action_batch)
next_state_values = torch.zeros(self.BUFFER_SIZE)
next_state_values[non_final_mask] = self.DQN(non_final_next_states).max(1)[0]
expected_state_action_values = (next_state_values * self.GAMMA) + reward_batch.reshape(-1)
loss = self.DQN.loss(state_action_values, expected_state_action_values.unsqueeze(1))
self.DQN.optimizer.zero_grad()
loss.backward()
self.DQN.optimizer.step()
```

In the above code my average score does not go above -50. (Just a small note I stored the above transitions in the form of torch tensors)

However with this implementation

```
self.DQN.optimizer.zero_grad()
# Then proceeded to extract states, action and rewards
state_batch = torch.tensor(batch.state, dtype=torch.float32)
action_batch = torch.tensor(batch.action, dtype=torch.int64)
reward_batch = torch.tensor(batch.reward, dtype=torch.float32)
non_terminal_mask = torch.tensor(tuple(map(lambda s: s is not None, batch.next_state)), dtype=torch.bool)
non_terminal_state = torch.tensor([s for s in batch.next_state if s is not None], dtype=torch.float32)
state_action_values = self.DQN(state_batch)
target_state_action_values = state_action_values.clone()
next_state_action_values = torch.zeros(self.BUFFER_SIZE)
next_state_action_values[non_terminal_mask] = self.DQN(non_terminal_state).max(1)[0]
batch_index = torch.arange(0, self.BUFFER_SIZE, dtype=torch.int64)
target_state_action_values[batch_index, action_batch] = reward_batch + next_state_action_values * self.GAMMA
loss = self.DQN.loss(target_state_action_values, state_action_values)
loss.backward()
self.DQN.optimizer.step()
```

With the above code my score does not seem to go above 30. (Just a small note that above I stored the transitions as np arrays (directly from gym)).

I am concerned that there is something wrong with my model creation and learning pipelining. Any help would be appreciated.

Below is my model class.

```
class DQN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(DQN, self).__init__()
self.input_layer = nn.Linear(input_size, hidden_size)
self.hidden1 = nn.Linear(hidden_size, hidden_size)
self.hidden2 = nn.Linear(hidden_size, hidden_size)
self.output_layer = nn.Linear(hidden_size, output_size)
self.loss = nn.MSELoss()
self.optimizer = optim.Adam(self.parameters())
def forward(self, state):
x = F.relu(self.input_layer(state))
x = F.relu(self.hidden1(x))
x = F.relu(self.hidden2(x))
actions = self.output_layer(x)
return actions
```