Value loss not converging in Actor Critic

Hi, I am working on an Actor Critic. The problem that I am working on is like this: for an episode’s each step I calculate the policy loss and value loss and store them separately in 2 lists. After the episode ends, I do ‘torch.stack’ on the means of each of the lists to get the Policy Loss and Value Loss of the entire episode. Now the question is how am I calculating the policy and value losses for each step of the episode.
The calculation of the policy loss for each step is given at Equation Number 5 of 1707.06347.pdf (arxiv.org). And for Valueloss:

Vt = gamma*next_Value + reward
Valueloss= F.smooth_l1_loss(Vt, Value)

After the end of the episode, I calculate the Policy Loss and Value Loss for the episode as described above. The final loss is:

Loss = Policy_loss + alpha*Value_loss

where alpha is a non trainable parameter that I am using to make sure the 2 losses have close values.
When I am printing the losses, I am seeing something like this:

Policy_loss = tensor(36.1547, dtype=torch.float64, grad_fn=<MeanBackward0>)
Value_loss = tensor(85.1103, grad_fn=<MulBackward0>)

And I am backpropagating over the Loss (Loss = Policy_loss + alpha*Value_loss). After many runs I am seeing Policy_loss as well as Loss are decreasing but Value_loss fluctuates, does not decrease. I am sure this is affecting the result.
Can anyone help?

This is how I calculate state_values:

class Policy(nn.Module):
    """
    implements both actor and critic in one model
    """
    def __init__(self):
        super(Policy, self).__init__()
        self.fc1 = nn.Linear(state_size*2, 64)

        self.fc2 = nn.Linear(64, 32)
        self.layer_out = nn.Sequential(nn.Linear(32, 1))


        # actor's layer
        self.action_head = nn.Linear(state_size*2, 1)
        self.muHead = nn.Sigmoid()
        self.varHead = nn.Softplus()

        # critic's layer
        self.value_head = nn.Sequential(nn.Linear(state_size*2, state_size), nn.ReLU(), nn.Linear(state_size, 1), nn.ReLU())


    def forward(self, x):
        """
        forward of both actor and critic
        """
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.layer_out(x)
        x = torch.flatten(x)

        .........
        state_values = self.value_head(x)
        return mu, sigma, state_values

#in main.py
#within the episode loop
       Vt = gamma*next_Value.to(device) + reward.to(device)
       Lv = Vt - Value    #or, F.smooth_l1_loss(Vt, Value)
       value_losses.append(Lv)
#outside the episode loop
opt.zero_grad()
loss = torch.stack(policy_losses).mean().to(device) + torch.stack(value_losses).mean().to(device)
loss.backward()
opt.step()

I checked that even if I just do torch.stack(value_losses).mean().backwward() that is backward on only ValueLoss, it still is not decreasing.

When I check if the model is getting updated at all using:

def test_all_parameters_updated_for_specific_model(self):
        self.private_input.requires_grad = True
        self.state_input.requires_grad = True
        optimizer = torch.optim.Adam(model.parameters(), lr=0.0005, eps=1e-8, weight_decay=1e-5, amsgrad=True)
        optimizer.zero_grad()
        output1, output2, output3 = model(self.private_input, self.state_input)
        loss = output1 + output2 + output3
        loss.backward()
        optimizer.step()

        #for param in model.parameters():
        #    print(type(param.data), param.size())
        
        for param_name, param in model.named_parameters():
            if param.requires_grad:
                with self.subTest(name=param_name):
                    self.assertIsNotNone(param.grad)
                    self.assertNotEqual(0., torch.sum(param.grad ** 2).item())
                    print(param_name,torch.sum(param.grad ** 2).item())

I am seeing that yes all params including the valueheads are updating