This year I got introduced to neural networks and machine learning in Uni, and now I want to create a bot for a game this summer. The game is called Awale, and it is fairly simple. All you have to do is choose one out of six cups every turn, and your opponent does the same. Under some circumstances you win some points, and the player with the most points in the end wins the game.

So I wanted to make a Deep Q-learning agent for this. I have built something like this before, for an assignment, but I thought the assignment was a bit oversimplified and I wanted a bit more of a challenge. I have built the architecture for the game, and i am now almost ready to train my agents. The only thing I am doubting, and I hope you can help me with this, is the following.

Every turn in the game, I have to do a forward pass three times (once for me, once for the opponent, and once again for me), while I only want to calculate the loss for the first forward pass. I understand that a call to with torch.nograd() i can do forward passes without autograd doing calculations in the background, so I had the idea to surround the second and third forward pass in the nograd(), but is this the best way to go here?

the idea sounds really interesting and it would be great, if you could keep us updated about your progress here.

I think that’s the best approach to make sure only the output of the first forward pass has a valid grad_fn to calculate the gradients.
Let us know, if you encounter any problems with this approach.

I am still trying to get this absolutely right, so let me rephrase my question, jus to be absolutely sure that i am doing this right.

Model M takes an input s_t and gives an output Q_t. Q_t will be the output with which we want to calculate the loss.
Now we want to create the target vector, and since this is a Q-learning approach, we have to do some more forward passes.

Using Q_t, we can compute a_t, and with a_t, we can compute r_t and s_{t+1} (these are all game mechanics).
We can feed s_{t+1} to the network again, and acquire Q_{t+1}. This will be used to acquire a_{t+1}, and that will in turn be used to acquire r_{t+1} and s_{t+2}. Now we are almost there. We feed s_{t+2} to the network, and acquire Q_{t+2}. Now we run a_t through the network again to get a ‘copy’ of Q_t (call it Q_t’) and replace Q_t’[a_t] with r_t+theta*max(Q_{t+2}) (theta is a factor that discounts future rewards). We then compute the loss with loss_function(Q_t, Q_t’).

The tricky thing here is that both the output and the target vectors that are fed into the loss function are output of the same network. I only want the backward pass to run through the part of the graph that is resposible for producing Q_t.

Am i doing this right by wrapping everything except the first forward pass (where M takes input s_t and generates output Q_t) with torch.nograd()?

I’m not really familiar with RL and I’m not sure to see all possible pitfalls in this approach.
A “safe” approach could be to detach the target before calculating the loss.
This would make sure to only backpropagate using the first input parameter to the criterion.
Would that work for your approach?