Hello, I’m trying to train an AI to play pong. I have 2 separate neural network controlling each paddle (left AI + right AI). So how I implemented my training environment is that for every time step I would obtain the gradient vector for the action taken (up,down,stop) given the states (paddle + ball states) for the specific time step. I would then accumulate these gradients for later updates. I have the following code for the methods that I used to sample the policy network given state and action mask, and method to obtain the loss value below.
def lossvalue(self,logits,target):
Pred = [logits] #wrap logits tensor in list
Pred = torch.stack(Pred, dim=0) #reformat into nested tensor
Loss = self.LossFn(Pred,torch.LongTensor([target]))
return Loss
def samplingaction(self, raw_output_tensor, actionmask):
if (self.training == True): #training
copy_logits = raw_output_tensor.clone() #make copy of raw logits tensor
copy_logits = copy_logits.masked_fill_(actionmask,-math.inf) #mask invalid actions
distribution = Categorical(logits=copy_logits) #create probability distribution
action = distribution.sample() #sample action
else: #evaluating
raw_output_tensor = raw_output_tensor.masked_fill_(action,-math.inf) #mask invalid action
action = torch.argmax(raw_output_tensor) #choose best action
return action
I then use these methods in the game loop as such to collect the gradient for each time step for later updates
#get game states
left_states = GetGameStates()
right_states = GetGameStates(False)
#forward propogation + remember numpy -> tensor conversion
left_logit = left_AI.forward(torch.FloatTensor(left_states))
right_logit = right_AI.forward(torch.FloatTensor(right_states))
#obtain left action mask
left_actionmask = torch.tensor([0,0,0],dtype=torch.bool)
if (left_y <= 0): #up invalid action
left_actionmask[1] = 1
elif (left_y + pad_length >= height): #down invalid action
left_actionmask[2] = 1
#obtain right action mask
right_actionmask = torch.tensor([0,0,0],dtype=torch.bool)
if (right_y <= 0): #up invalid action
right_actionmask[1] = 1
elif (right_y + pad_length >= height): #down invalid action
right_actionmask[2] = 1
#sample action + remember tensor -> scalar conversion
left_action = left_AI.samplingaction(left_logit,left_actionmask).item()
right_action = right_AI.samplingaction(right_logit,right_actionmask).item()
#obtain loss values
LeftLoss = left_AI.lossvalue(left_logit,left_action)
RightLoss = right_AI.lossvalue(right_logit,right_action)
#obtain and accumulate gradient for decision
LeftLoss.backward()
RightLoss.backward()
I’m using cross entropy loss and Adam optimizer. The idea was every time the paddle successfully hit the ball, I would perform gradient descent and encourage the actions that lead up to the hit (using gradients accumulated over the time step). If the paddle missed then I would perform gradient ascent to discourage the actions that lead up to the miss (using gradients accumulated over the time step). Then every time I finished my updates, I would the then zero the grad, and the process of accumulating gradients for each time step continued again. I have my learn method below, that I would call anytime a paddle hit/missed the ball. The reset is just a flag variable for certain situations where I just wanted to simply zero the gradient without performing any updates.
def learn(self, reward, reset):
#create optimizer object
if (reward == True):
optim = torch.optim.Adam(self.parameters(), lr=0.002)
else:
optim = torch.optim.Adam(self.parameters(), lr=0.002, maximize=True)
if (reset == False):
optim.step() #update NN
optim.zero_grad() #clear NN gradient
The problem is that when I train the paddles, they’re not learning. Sometimes a paddle will just constantly keep moving up or down and get stuck in this bad state. I did see some learning earlier with a previous model, the paddle was not great but decent. But I changed my states and restarted training and can’t seem to get the AI back to that level.
Any suggestions will be appreciated, thank you