Hi, I am working on an Actor Critic. The problem that I am working on is like this: for an episode’s each step I calculate the policy loss and value loss and store them separately in 2 lists. After the episode ends, I do ‘torch.stack’ on the means of each of the lists to get the Policy Loss and Value Loss of the entire episode. Now the question is how am I calculating the policy and value losses for each step of the episode.

The calculation of the policy loss for each step is given at Equation Number 5 of 1707.06347.pdf (arxiv.org). And for Valueloss:

```
Vt = gamma*next_Value + reward
Valueloss= F.smooth_l1_loss(Vt, Value)
```

After the end of the episode, I calculate the Policy Loss and Value Loss for the episode as described above. The final loss is:

```
Loss = Policy_loss + alpha*Value_loss
```

where alpha is a non trainable parameter that I am using to make sure the 2 losses have close values.

When I am printing the losses, I am seeing something like this:

```
Policy_loss = tensor(36.1547, dtype=torch.float64, grad_fn=<MeanBackward0>)
Value_loss = tensor(85.1103, grad_fn=<MulBackward0>)
```

And I am backpropagating over the Loss (Loss = Policy_loss + alpha*Value_loss). After many runs I am seeing Policy_loss as well as Loss are decreasing but Value_loss fluctuates, does not decrease. I am sure this is affecting the result.

Can anyone help?

This is how I calculate state_values:

```
class Policy(nn.Module):
"""
implements both actor and critic in one model
"""
def __init__(self):
super(Policy, self).__init__()
self.fc1 = nn.Linear(state_size*2, 64)
self.fc2 = nn.Linear(64, 32)
self.layer_out = nn.Sequential(nn.Linear(32, 1))
# actor's layer
self.action_head = nn.Linear(state_size*2, 1)
self.muHead = nn.Sigmoid()
self.varHead = nn.Softplus()
# critic's layer
self.value_head = nn.Sequential(nn.Linear(state_size*2, state_size), nn.ReLU(), nn.Linear(state_size, 1), nn.ReLU())
def forward(self, x):
"""
forward of both actor and critic
"""
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.layer_out(x)
x = torch.flatten(x)
.........
state_values = self.value_head(x)
return mu, sigma, state_values
#in main.py
#within the episode loop
Vt = gamma*next_Value.to(device) + reward.to(device)
Lv = Vt - Value #or, F.smooth_l1_loss(Vt, Value)
value_losses.append(Lv)
#outside the episode loop
opt.zero_grad()
loss = torch.stack(policy_losses).mean().to(device) + torch.stack(value_losses).mean().to(device)
loss.backward()
opt.step()
```

I checked that even if I just do torch.stack(value_losses).mean().backwward() that is backward on only ValueLoss, it still is not decreasing.

When I check if the model is getting updated at all using:

```
def test_all_parameters_updated_for_specific_model(self):
self.private_input.requires_grad = True
self.state_input.requires_grad = True
optimizer = torch.optim.Adam(model.parameters(), lr=0.0005, eps=1e-8, weight_decay=1e-5, amsgrad=True)
optimizer.zero_grad()
output1, output2, output3 = model(self.private_input, self.state_input)
loss = output1 + output2 + output3
loss.backward()
optimizer.step()
#for param in model.parameters():
# print(type(param.data), param.size())
for param_name, param in model.named_parameters():
if param.requires_grad:
with self.subTest(name=param_name):
self.assertIsNotNone(param.grad)
self.assertNotEqual(0., torch.sum(param.grad ** 2).item())
print(param_name,torch.sum(param.grad ** 2).item())
```

I am seeing that yes all params including the valueheads are updating