I’m training a bot to play a specific game. Some actions are considered illegal at particular states. I’m currently ignoring the illegal moves and selecting the second (or third if the second is illegal as well) highest q-value from outputted q-values for each action.
I was being paranoid about the efficiency of this method and decided to take a look at what other people do. I came across this article: https://email@example.com/part-4-neural-network-q-learning-a-tic-tac-toe-player-that-learns-kind-of-2090ca4798d
He is doing the same thing and justifies this method by saying "We will be ignoring the fact that for a particular board state some positions would already be taken and no longer be an option. The player will deal with this when choosing a move and ignore illegal moves no matter what their Q values are. That is, we do not try to teach the Neural Network what moves are legal or not. Again, general advice you find is that this is the better approach "
Relevant code of mine looks like following for now:
#Exploit else: with torch.no_grad(): tensor_from_net = policy_net(state).to(self.device) while (True): max_index = tensor_from_net.argmax() #If illegal move is given as output by the model, punish that action and make it select an action again. if max_index.item() not in available_actions: tensor_from_net[max_index] = torch.tensor(-100) else: break return max_index.unsqueeze_(0)
That “punishment” in the while loop is temporary though, it is not affecting the net.
- Which approach is better? Ignore the illegal moves or punish the illegal moves?
- If ignoring the illegal moves is better, is my approach okay? What can be done to enhance it? And if you think punishing the illegal moves is better, I’d be appreciated if you give me a hint on how to implement it.