Custom Model with custom submodels in it not updating

Hi there,

I need to use a Reinforcement Learning Model where I need to use a RNN model. I wrote the code for the RNN like this:

class LSTMModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim, dropout_prob, state_size):
        super(LSTMModel, self).__init__()

        # Defining the number of layers and the nodes in each layer
        self.hidden_dim = hidden_dim
        self.layer_dim = layer_dim
   def forward(self):
        return ....
rnn = LSTMModel(input_dim, hidden_dim, layer_dim, output_dim, dropout, state_size)

And I wrote the policy like this:

class Policy(nn.Module):
    implements both actor and critic in one model
    def __init__(self):
        super(Policy, self).__init__()
   def forward(self):
policy = Policy()

Both of these are used in the model like this:

class Model(torch.nn.Module):
    def __init__(self, ann_private, rnn, policy):
        super(Model, self).__init__()
        self.rnn = rnn
        self.policy = policy

    def forward(self, private_input, state):
model = Model(rnn, policy)

The optimizer is written as optimizer = optim.Adam(model.parameters(), lr=1e-3)

The loss is calculated in a traditional actor-critic way. In each ‘subepisode’, we calculate the states, actions, rewards, next states and store them in a memory. Later, we randomly choose a sample from the memory and calculate 2 different kinds of losses Loss_policy and Loss_value. We use Variable(…, requires_grad = True) on the calculated losses. Then the code is:

loss = Lp + alpha*Lv

But, after each ‘subepisode’, I run

for name, param in model.named_parameters():
     if param.requires_grad:

to print the trainable parameters. I find all the params that I want, but I am seeing that none of them is changing. That means, the model along with the rnn and policy is not updating. I don’t know whats going on.

This sounds wrong as you are rewrapping the loss tensors and are thus detaching them from the computation graph besides Variables being deprecated since PyTorch 0.4, so remove this line of code and rerun your training.

Okay… Thanks, this sounds correct.
So, I did that. The complete policy code is this:

class Policy(nn.Module):
    implements both actor and critic in one model
    def __init__(self):
        super(Policy, self).__init__()
        self.fc1 = nn.Linear(state_size+1, 128)

        self.fc2 = nn.Linear(128, 64)

        # actor's layer
        self.action_head = nn.Linear(64, action_size) = nn.Sigmoid()
        self.var = nn.Softplus()

        # critic's layer
        self.value_head = nn.Linear(64, 1)

    def forward(self, x):
        forward of both actor and critic
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))

        # actor: choses action to take from state s_t 
        # by returning probability of each action
        action_prob = self.action_head(x)
        mu =
        var = self.var(action_prob)

        # critic: evaluates being in the state s_t
        state_values = self.value_head(x)

        return mu, var, state_values
policy = Policy()

In model class, we are calling this policy after the rnn. You can see, this is actor critic. And in agent class’s act method, we are calling the model to get the action like this:

    def act(self, some_input, state):
      mu, var, state_value = self.model(some_input, state)
      mu =
      sigma = torch.sqrt(var).data.cpu().numpy()
      action = np.random.normal(mu, sigma)
      action = np.clip(action, 0, 1)
      action = torch.from_numpy(action/1000)
      return action, state_value

I must mention that in optimizer, we are calling the model.parameters. When we print all the trainable parameters in each epoch, we see that everything else is changing except for the policy.action_head. Any idea why this is happening? I am not using Variable any more. I must also mention how the losses are calculated now:

       advantage = reward - Value
       Lp = -math.log(pdf_prob_now)*advantage
       #similar for value_losses
#after all the runs in the epoch is done
loss = torch.stack(policy_losses).sum() + alpha*torch.stack(value_losses).sum()

Here Value is the state_value (the 2nd output from agent.act) and the pdf_prob_now is the probability of the action from all possible actions which is calculated like this:

def find_pdf(policy, action, rnn_output):
    mu, var, _ = policy(rnn_output)
    mu =
    sigma = torch.sqrt(var).data.cpu().numpy()
    pdf_probability = stats.norm.pdf(action.cpu(), loc=mu, scale=sigma)
    return pdf_probability

I don’t know how and when each code snippet is called, but check the .grad attribute of policy.action_head before and after the first backward call and make sure its set to None first and shows a valid gradient afterwards (remove the zero_grad call if it was used before).
If you are seeing valid gradient values (even if they are close to zero) the gradients are indeed calculated but might be too small to see a significant change and you should then debug the used operations.