# Value loss not converging in Actor Critic

Hi, I am working on an Actor Critic. The problem that I am working on is like this: for an episode’s each step I calculate the policy loss and value loss and store them separately in 2 lists. After the episode ends, I do ‘torch.stack’ on the means of each of the lists to get the Policy Loss and Value Loss of the entire episode. Now the question is how am I calculating the policy and value losses for each step of the episode.
The calculation of the policy loss for each step is given at Equation Number 5 of 1707.06347.pdf (arxiv.org). And for Valueloss:

``````Vt = gamma*next_Value + reward
Valueloss= F.smooth_l1_loss(Vt, Value)
``````

After the end of the episode, I calculate the Policy Loss and Value Loss for the episode as described above. The final loss is:

``````Loss = Policy_loss + alpha*Value_loss
``````

where alpha is a non trainable parameter that I am using to make sure the 2 losses have close values.
When I am printing the losses, I am seeing something like this:

``````Policy_loss = tensor(36.1547, dtype=torch.float64, grad_fn=<MeanBackward0>)
``````

And I am backpropagating over the Loss (Loss = Policy_loss + alpha*Value_loss). After many runs I am seeing Policy_loss as well as Loss are decreasing but Value_loss fluctuates, does not decrease. I am sure this is affecting the result.
Can anyone help?

This is how I calculate state_values:

``````class Policy(nn.Module):
"""
implements both actor and critic in one model
"""
def __init__(self):
super(Policy, self).__init__()
self.fc1 = nn.Linear(state_size*2, 64)

self.fc2 = nn.Linear(64, 32)
self.layer_out = nn.Sequential(nn.Linear(32, 1))

# actor's layer

# critic's layer
self.value_head = nn.Sequential(nn.Linear(state_size*2, state_size), nn.ReLU(), nn.Linear(state_size, 1), nn.ReLU())

def forward(self, x):
"""
forward of both actor and critic
"""
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.layer_out(x)
x = torch.flatten(x)

.........
return mu, sigma, state_values

#in main.py
#within the episode loop
Vt = gamma*next_Value.to(device) + reward.to(device)
Lv = Vt - Value    #or, F.smooth_l1_loss(Vt, Value)
value_losses.append(Lv)
#outside the episode loop
loss = torch.stack(policy_losses).mean().to(device) + torch.stack(value_losses).mean().to(device)
loss.backward()
opt.step()
``````

I checked that even if I just do torch.stack(value_losses).mean().backwward() that is backward on only ValueLoss, it still is not decreasing.

When I check if the model is getting updated at all using:

``````def test_all_parameters_updated_for_specific_model(self):
output1, output2, output3 = model(self.private_input, self.state_input)
loss = output1 + output2 + output3
loss.backward()
optimizer.step()

#for param in model.parameters():
#    print(type(param.data), param.size())

for param_name, param in model.named_parameters():