I have been trying to implement Actor Critic with a convolutional neural network. There are two different images as the state for the reinforcement learning agent. The output(actions for Actor) of the CNN is the (almost)same for different inputs after (random)initialisation. Due to this even after training the agent never learns anything useful.

State definitions(2 inputs):

Input 1: [1,1,90,128] image with maximum value for the pixel being 45.

Input 2: [1,1,45,80] image with maximum value for the pixel being 45.

Expected output by the actor: [x,y]: a 2-dimensional vector according to the state. Here x is expected in the range [0,160] and y is expected in the range [0,112]

Different types of modifications tried with the input:

1: Feed both the images as it is.

2: Feed both images with normalisation as `(img/45)`

so that pixel values are from [0,1]

3:Feed both images with normalisation as `2*((img/45)-0.5)`

so that pixel values are from [-1,1]

4: Feed both images with normalisation as `(img-mean)/std`

Result: The output of the CNN remains almost same.

The code for the definition of actor has been given below.

```
import numpy as np
import pandas as pd
from tqdm import tqdm
import time
import cv2
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
class Actor(nn.Module):
def __init__(self, action_dim, max_action):
super(Actor,self).__init__()
# state image [1,1,90,128]
self.conv11 = nn.Conv2d(1,16,5)
self.conv11_bn = nn.BatchNorm2d(16)
self.conv12 = nn.Conv2d(16,16,5)
self.conv12_bn = nn.BatchNorm2d(16)
self.fc11 = nn.Linear(19*29*16,500)
# dim image [1,1,45,80]
self.conv21 = nn.Conv2d(1,16,5)
self.conv21_bn = nn.BatchNorm2d(16)
self.conv22 = nn.Conv2d(16,16,5)
self.conv2_bn = nn.BatchNorm2d(16)
self.fc21 = nn.Linear(8*17*16,250)
# common pool
self.pool = nn.MaxPool2d(2,2)
# after concatenation
self.fc2 = nn.Linear(750,100)
self.fc3 = nn.Linear(100,10)
self.fc4 = nn.Linear(10,action_dim)
self.max_action = max_action
def forward(self,x,y):
# state image
x = self.conv11_bn(self.pool(F.relu(self.conv11(x))))
x = self.conv11_bn(self.pool(F.relu(self.conv12(x))))
x = x.view(-1,19*29*16)
x = F.relu(self.fc11(x))
# state dim
y = self.conv11_bn(self.pool(F.relu(self.conv21(y))))
y = self.conv11_bn(self.pool(F.relu(self.conv22(y))))
y = y.view(-1,8*17*16)
y = F.relu(self.fc21(y))
# concatenate
z = torch.cat((x,y),dim=1)
z = F.relu(self.fc2(z))
z = F.relu(self.fc3(z))
z = self.max_action*torch.tanh(self.fc4(z))
return z
# to read different sample states for testing
obs = []
for i in range(200):
obs.append(np.load('eval_episodes/obs_'+str(i)+'.npy',allow_pickle=True))
obs = np.array(obs)
def tensor_from_numpy(state):
# to add dimensions to tensor to make it [batch_size,channels,height,width]
state_img = state
state_img = torch.from_numpy(state_img).float()
state_img = state_img[np.newaxis, :]
state_img = state_img[np.newaxis, :].to(device)
return state_img
actor = Actor(2,torch.FloatTensor([160,112]))
for i in range(20):
a = tensor_from_numpy(obs[i][0])
b = tensor_from_numpy(obs[i][2])
print(actor(a,b))
```

Output of the above code:

```
tensor([[28.8616, 3.0934]], grad_fn=<MulBackward0>)
tensor([[27.4125, 3.2864]], grad_fn=<MulBackward0>)
tensor([[28.2210, 2.6859]], grad_fn=<MulBackward0>)
tensor([[27.6312, 3.9528]], grad_fn=<MulBackward0>)
tensor([[25.9290, 4.2942]], grad_fn=<MulBackward0>)
tensor([[26.9652, 4.5730]], grad_fn=<MulBackward0>)
tensor([[27.1342, 2.9612]], grad_fn=<MulBackward0>)
tensor([[27.6494, 4.2218]], grad_fn=<MulBackward0>)
tensor([[27.3122, 1.9945]], grad_fn=<MulBackward0>)
tensor([[29.6915, 1.9938]], grad_fn=<MulBackward0>)
tensor([[28.2001, 2.5967]], grad_fn=<MulBackward0>)
tensor([[26.8502, 4.4917]], grad_fn=<MulBackward0>)
tensor([[28.6489, 3.2022]], grad_fn=<MulBackward0>)
tensor([[28.1455, 2.7610]], grad_fn=<MulBackward0>)
tensor([[27.2369, 3.4243]], grad_fn=<MulBackward0>)
tensor([[25.9513, 5.3057]], grad_fn=<MulBackward0>)
tensor([[28.1400, 3.3242]], grad_fn=<MulBackward0>)
tensor([[28.2049, 2.6622]], grad_fn=<MulBackward0>)
tensor([[26.7446, 2.5966]], grad_fn=<MulBackward0>)
tensor([[25.3867, 5.0346]], grad_fn=<MulBackward0>)
```

The states(`.npy`

) files can be found here

For different states the actions should vary from [0-160,0-112] but here the outputs just vary a little bit.

Note: the input images are initially sparse(lots of zeros in the image)

Is there a problem with the state pixel values or the network definition?