I am attempting to replicate (using PyTorch) another person’s Deep Q Network (which used TensorFlow) to learn the OpenAI gym environment ‘BreakoutDeterministic-v4’. The one I am attempting to replicate is here: https://github.com/fg91/Deep-Q-Learning/blob/master/DQN.ipynb
When looking at the actions chosen for the batch size, the action chosen is always the same, no matter how long the network has trained. This action sometimes changes, but it is always the same for every state in the batch, despite these states being randomly selected.
All parameters in the network are directly taken from the network I am trying to replicate, and the input image is cropped in the exact same manner also. The network I am replicating does work as I have tested it.
Below is the code used for the DQN.
class DQN(nn.Module): def __init__(self, input_shape, n_actions): super(DQN, self).__init__() self.conv = nn.Sequential( nn.Conv2d(input_shape, 32, kernel_size=8, stride=4), nn.ReLU(), nn.Conv2d(32, 64, kernel_size=4, stride=2), nn.ReLU(), nn.Conv2d(64, 64, kernel_size=3, stride=1), nn.ReLU() ) self.val_conv = nn.Sequential( nn.Conv2d(64, 512, kernel_size=7, stride=1), nn.ReLU() ) self.adv_conv = nn.Sequential( nn.Conv2d(64, 512, kernel_size=7, stride=1), nn.ReLU() ) conv_out_size = self._get_conv_out(input_shape) self.value_stream = nn.Sequential( nn.Linear(conv_out_size, 1) ) self.advantage_stream = nn.Sequential( nn.Linear(conv_out_size, n_actions) ) def _get_conv_out(self, shape): o = self.conv(torch.zeros(1, *shape)) o = self.val_conv(torch.zeros(*tuple(o.size()))) return int(np.prod(o.size())) def forward(self, state): print(state.cpu()) features = self.conv(state) feat_val = self.val_conv(features) feat_adv = self.adv_conv(features) feat_val = feat_val.view(feat_val.size(0), -1) feat_adv = feat_adv.view(feat_adv.size(0), -1) values = self.value_stream(feat_val) advantages = self.advantage_stream(feat_adv) qvals = values + (advantages - advantages.mean()) return qvals
I am unsure if this is helpful but the Optimizer I am using is Adam, learning rate of 0.00001, and the epsilon decreases at the exact same rate as the one I am trying to replicate.