Loss not converge In DDPG

yxz77777 · November 9, 2023, 5:20am

with the training process , the actor_loss and critic_loss increase rather than decrease.
and here is my actor net and critic net with hidden size1=512,hidden size2=1024,hidden size3=512,hidden size4=256:

class ActorNet(nn.Module):
    def __init__(self):
        super(ActorNet, self).__init__()
        self.input_size = 10 + K * 2
        self.output_size = 1 + 1
        self.fc1 = nn.Linear(self.input_size, HIDDEN_SIZE_1)
        self.fc2 = nn.Linear(HIDDEN_SIZE_1, HIDDEN_SIZE_2)
        self.fc3 = nn.Linear(HIDDEN_SIZE_2, HIDDEN_SIZE_3)
        self.fc4 = nn.Linear(HIDDEN_SIZE_3, HIDDEN_SIZE_4)
        self.fc5 = nn.Linear(HIDDEN_SIZE_4, self.output_size)
        # init weight
        nn.init.xavier_normal_(self.fc1.weight)
        nn.init.constant_(self.fc1.bias, 0)
        nn.init.xavier_normal_(self.fc2.weight)
        nn.init.constant_(self.fc2.bias, 0)
        nn.init.xavier_normal_(self.fc3.weight)
        nn.init.constant_(self.fc3.bias, 0)
        nn.init.xavier_normal_(self.fc4.weight)
        nn.init.constant_(self.fc4.bias, 0)
        nn.init.xavier_normal_(self.fc5.weight)
        nn.init.constant_(self.fc5.bias, 0)


    def forward(self, x):
        x = self.fc1(x)
        x = torch.relu(x)
        x = self.fc2(x)
        x = torch.relu(x)
        x = self.fc3(x)
        x = torch.relu(x)
        x = self.fc4(x)
        x = torch.relu(x)
        x = self.fc5(x)
        x[0][0] = torch.sigmoid(x[0][0])
        x[0][1] = torch.tanh(x[0][1])
        return x
class CriticNet(nn.Module):
    def __init__(self):
        super(CriticNet, self).__init__()
        self.input_size = 10 + K * 2 + 2
        self.output_size = 1
        self.fc1 = nn.Linear(self.input_size, HIDDEN_SIZE_1)
        self.fc2 = nn.Linear(HIDDEN_SIZE_1, HIDDEN_SIZE_2)
        self.fc3 = nn.Linear(HIDDEN_SIZE_2, HIDDEN_SIZE_3)
        self.fc4 = nn.Linear(HIDDEN_SIZE_3, HIDDEN_SIZE_4)
        self.fc5 = nn.Linear(HIDDEN_SIZE_4, self.output_size)
        # init weight
        nn.init.xavier_normal_(self.fc1.weight)
        nn.init.constant_(self.fc1.bias, 0)
        nn.init.xavier_normal_(self.fc2.weight)
        nn.init.constant_(self.fc2.bias, 0)
        nn.init.xavier_normal_(self.fc3.weight)
        nn.init.constant_(self.fc3.bias, 0)
        nn.init.xavier_normal_(self.fc4.weight)
        nn.init.constant_(self.fc4.bias, 0)
        nn.init.xavier_normal_(self.fc5.weight)
        nn.init.constant_(self.fc5.bias, 0)

    def forward(self, state,action):
        x = torch.cat([state, action], 1)
        x = self.fc1(x)
        x = torch.relu(x)
        x = self.fc2(x)
        x = torch.relu(x)
        x = self.fc3(x)
        x = torch.relu(x)
        x = self.fc4(x)
        x = torch.relu(x)
        x = self.fc5(x)
        return x

and the actor loss and the critic loss is defined by :

        policy_Q = self.critic(state_batch, self.actor(state_batch))
        actor_loss = -policy_Q.mean()

        next_action_batch = self.target_actor(next_state_batch)
        target_Q = self.target_critic(next_state_batch,next_action_batch.detach())
        label_Q = reward_batch + GAMMA * target_Q
        policy_Q_ = self.critic(state_batch, action_batch)
        #critic_loss = ((label_Q - policy_Q_) ** 2).mean()
        critic_loss = self.value_criterion(label_Q, policy_Q_.detach())

where the value_criterion is equal to mse loss

yxz77777 · November 9, 2023, 5:23am

my learning rate is set as 1e-5,and use soft update to update the target network with 0.005

vmoens · November 9, 2023, 3:01pm

It may not be an issue, the only relevant metric in RL is the reward.
Let me explain: say that your rewards are between -2 and -1. The value (discounted return) of an optimal policy will be negative, and a random policy will be lower than that.
When you initialize your value net will have a value of zero but as learning progresses the value will get lower (hence so will the actor gain, ie the loss will get bigger).
A similar reasoning can make you see why the value loss may get bigger.

yxz77777 · November 9, 2023, 3:52pm

Thanks for your reply! Thx so much!
With the training progress growing,the rewards seems don’t change.it seems that my agent didn’t learn anything thing in the training.
Besides after the training progress,I set a test function to check the ability which has been trained in train.the results looks like very terrible,what may cause this result?and how can I find out my question and solve it!
Thanks so much again for your reply!

vmoens · November 9, 2023, 4:16pm

You can always check torchrl’s knowledge base
http://pytorch.org/rl/reference/generated/knowledge_base/DEBUGGING_RL.html

yxz77777 · November 10, 2023, 6:32am

Thx so much for your help!
I will read it and back to my code ,thanks again!

yxz77777 · November 29, 2023, 12:51pm

Hi @vmoens , I found that my question is that the output of the actor network is constrained to a threshold value.It causes the gradient vanishing.
How can i deal with this question?
I have changed the actorloss funtion with
actor_loss = -policy_Q.mean() + kappa_v * (torch.pow(torch.max(penalty[0] - zeta_s, 0)[0], 2) + torch.pow(torch.max(-penalty[0] - zeta_s, 0)[0], 2)) + kappa_a * (torch.pow(torch.max(penalty[1] - zeta_t, 0)[0], 2) + torch.pow(torch.max(-penalty[1] - zeta_t, 0)[0], 2))
it used the Pre-Activation Penalty to aviod the gradient vanishing,but after training,it does not works,the gradient still vanish after a few epoch.
what should i do?
hope for your reply

J_Johnson · November 29, 2023, 4:29pm

Improper initialization can result in vanishing/exploding gradients.

Layers where you’re using ReLU activation should use Kaiming for initialization. Or keep Xavier and change those activations to Tanh.
You have both a Sigmoid and Tanh activation function on the final layer. You could probably get by with one or the other. Or just remove the final activation.

yxz77777 · November 30, 2023, 1:10pm

Hi @J_Johnson ,thanks for your advice!
I have changed the initialization with kaiming ,and I also removed the tanh where there is only sigmoid function now.
But it seems that the gradient still vanishing.
I have check the output of the actor-net before sigmoid.After a few episodes,it becomes to 27,I think this is wrong, how should I solve this situation?
Is there any possibility for the reward function which leads to the wrong way for actor-net training?
Hope for your reply!