How to do Q learning delayed reward

I am making a game. In this game all agensts interacts with each other and next state and reward
can be calculated after all agents’ action is determined.

I want to train agent’s action based on Q learning.

But the problem is that reward can only be calculated after all agents’ action is determined.
And if I accumulate the reward for all agent, it will be meaningless because it will be close to 0.

Basically Q learning needs reward and next state for each step(In this case, for every units).
But in my game model, it can be only calculated after all actions of agents are determined.

How to deal with this situation?

Is it really impossible to do backward many times after forward many times in pytorch?

I tried this way. I get action from my model with no grad option. and I store the state and action. I do this for all the units.(battle result can only be calculated after all units’ action is determined)

Then after I calculated result with actions from neural net, I get reward for each units.

Then I use stored state and action, I update my model weights with backward() for every units.

But here is the problem. Because I updated my model for each units. I will get different action from the stored one except the first unit.

If I use gradient accumulation, the reward will be meaningless. Because sum of all reward will be close to 0.

How to deal with this situation?