Deep q network not learning and step not stepping towards target

Hi, I am trying to create a simple deep q network for rl with conv2d layers. I can’t figure out what I am doing wrong, and the only thing I can see that doesn’t seem right is when I get the model predicition for a state after the optimizer step it doesn’t seem to get closer to the target.

I am using pixels from pong in openai gym with single channel 90x90 images, a batch size of 32 and replay memory.

As an example if I try with a batch size of one, and try running self(states) again right after the optimizer step the output is as follows:

current_q_values -> -0.16351485  0.29163417  0.11192469 -0.08969332  0.11081569  0.37215832
q_target ->         -0.16351485  0.5336551   0.11192469 -0.08969332  0.11081569  0.37215832
self(states) ->     -0.8427617   0.6415581   0.44988257 -0.43897176  0.8693738   0.40007943

Does this look as what would be expected for a single step?

The network with loss and optimizer:

    self.in_layer = Conv2d(channels, 32, 8)
    self.hidden_conv_1 = Conv2d(32, 64, 4)
    self.hidden_conv_2 = Conv2d(64, 128, 3)
    self.hidden_fc1 = Linear(128 * 78 * 78, 64)
    self.hidden_fc2 = Linear(64, 32)
    self.output = Linear(32, action_space)

    self.loss = torch.nn.MSELoss()
    self.optimizer = torch.optim.Adam(
        self.parameters(), lr=learning_rate) # lr is 0.001

def forward(self, state):
    in_out = fn.relu(self.in_layer(state))
    in_out = fn.relu(self.hidden_conv_1(in_out))
    in_out = fn.relu(self.hidden_conv_2(in_out))
    in_out = in_out.view(-1, 128 * 78 * 78)
    in_out = fn.relu(self.hidden_fc1(in_out))
    in_out = fn.relu(self.hidden_fc2(in_out))
    return self.output(in_out)

Then the learning block:

        self.optimizer.zero_grad()

        sample = self.sample(self.batch_size)
        states = torch.stack([i[0] for i in sample])
        actions = torch.tensor([i[1] for i in sample], device=device)
        rewards = torch.tensor([i[2] for i in sample], dtype=torch.float32, device=device)
        next_states = torch.stack([i[3] for i in sample])
        dones = torch.tensor([i[4] for i in sample], dtype=torch.uint8, device=device)

        current_q_vals = self(states)
        next_q_vals = self(next_states)
        q_target = current_q_vals.clone()
        q_target[torch.arange(states.size()[0]), actions] = rewards + (self.gamma * next_q_vals.max(dim=1)[0]) * (~dones).float()

        loss = fn.smooth_l1_loss(current_q_vals, q_target)
        loss.backward()

        self.optimizer.step()

Thanks.