So difficult to train Actor-Critic (A2C), please help me


I am currently been able to train a system using Q-Learning. I will to move it to Actor_Critic (A2C) method. Please don’t ask me why for this move, I have to.

I am currently borrowing the implementation from

The thing is, I am keep getting a success rate of approx ~ 50% (which is basically random behavior). My game is a long episode (50 steps). I am wondering how should I debug this. Should I print out the reward, the value, or what? How should I debugg this?

Here are some log:

simulation episode 2: Success, turn_count =20
loss = tensor(1763.7875)

simulation episode 3: Fail,  turn_count= 42
loss = tensor(44.6923)

simulation episode 4: Fail,  turn_count= 42
loss = tensor(173.5872)

simulation episode 5: Fail,  turn_count= 42
loss = tensor(4034.0889)

simulation episode 6: Fail,  turn_count= 42
loss = tensor(132.7567)

loss = simulation episode 7: Success, turn_count =22
loss = tensor(2099.5344)

As a general trend, I have observed that for Success episodes, the loss tends to be huge, where as for Fail episode, the loss function output tends to be small. Any suggestion?