Policy Gradient For Pong Not Learning

Hello, I’m trying to train an AI to play pong. I have 2 separate neural network controlling each paddle (left AI + right AI). So how I implemented my training environment is that for every time step I would obtain the gradient vector for the action taken (up,down,stop) given the states (paddle + ball states) for the specific time step. I would then accumulate these gradients for later updates. I have the following code for the methods that I used to sample the policy network given state and action mask, and method to obtain the loss value below.

def lossvalue(self,logits,target):
        Pred = [logits] #wrap logits tensor in list
        Pred = torch.stack(Pred, dim=0) #reformat into nested tensor
        Loss = self.LossFn(Pred,torch.LongTensor([target]))

        return Loss

def samplingaction(self, raw_output_tensor, actionmask):
        if (self.training == True): #training
            copy_logits = raw_output_tensor.clone() #make copy of raw logits tensor
            copy_logits = copy_logits.masked_fill_(actionmask,-math.inf) #mask invalid actions
            distribution = Categorical(logits=copy_logits) #create probability distribution
            action = distribution.sample() #sample action
        else: #evaluating
            raw_output_tensor = raw_output_tensor.masked_fill_(action,-math.inf) #mask invalid action
            action = torch.argmax(raw_output_tensor) #choose best action
        
        return action

I then use these methods in the game loop as such to collect the gradient for each time step for later updates

            #get game states
            left_states = GetGameStates()
            right_states = GetGameStates(False)
            #forward propogation + remember numpy -> tensor conversion
            left_logit = left_AI.forward(torch.FloatTensor(left_states))
            right_logit = right_AI.forward(torch.FloatTensor(right_states))
            #obtain left action mask
            left_actionmask = torch.tensor([0,0,0],dtype=torch.bool)
            if (left_y <= 0): #up invalid action
                left_actionmask[1] = 1
            elif (left_y + pad_length >= height): #down invalid action
                left_actionmask[2] = 1
            #obtain right action mask
            right_actionmask = torch.tensor([0,0,0],dtype=torch.bool)
            if (right_y <= 0): #up invalid action
                right_actionmask[1] = 1
            elif (right_y + pad_length >= height): #down invalid action
                right_actionmask[2] = 1
            #sample action + remember tensor -> scalar conversion
            left_action = left_AI.samplingaction(left_logit,left_actionmask).item()
            right_action = right_AI.samplingaction(right_logit,right_actionmask).item()
            #obtain loss values
            LeftLoss = left_AI.lossvalue(left_logit,left_action)
            RightLoss = right_AI.lossvalue(right_logit,right_action)
            #obtain and accumulate gradient for decision
            LeftLoss.backward()
            RightLoss.backward()

I’m using cross entropy loss and Adam optimizer. The idea was every time the paddle successfully hit the ball, I would perform gradient descent and encourage the actions that lead up to the hit (using gradients accumulated over the time step). If the paddle missed then I would perform gradient ascent to discourage the actions that lead up to the miss (using gradients accumulated over the time step). Then every time I finished my updates, I would the then zero the grad, and the process of accumulating gradients for each time step continued again. I have my learn method below, that I would call anytime a paddle hit/missed the ball. The reset is just a flag variable for certain situations where I just wanted to simply zero the gradient without performing any updates.

def learn(self, reward, reset):
        #create optimizer object
        if (reward == True): 
            optim = torch.optim.Adam(self.parameters(), lr=0.002)
        else:
            optim = torch.optim.Adam(self.parameters(), lr=0.002, maximize=True)
        
        if (reset == False):
            optim.step() #update NN

        optim.zero_grad() #clear NN gradient

The problem is that when I train the paddles, they’re not learning. Sometimes a paddle will just constantly keep moving up or down and get stuck in this bad state. I did see some learning earlier with a previous model, the paddle was not great but decent. But I changed my states and restarted training and can’t seem to get the AI back to that level.

Any suggestions will be appreciated, thank you